mirror of
https://github.com/facebook/sapling.git
synced 2024-12-25 22:11:52 +03:00
add EdenServer recovery step and recover after failed takeover data send handshake
Summary: * This adds a `EdenServer::recover()` method to start back up on unsuccessful takeover data send. * On an unsuccessful ping, filfill the `shutdownPromise` with a `TakeoverSendError` continaing the constructed `TakeoverData`. After this `recover` function is called, `takeoverPromise_` is reset, `takeoverShutdown` is set to `false`, and the `runningState_` is set to `RUNNING`. With taking over from the returned `TakeoverData`, the user will not encounter `Transport not connected` errors on recovery. * This adds a `EdenServer::closeStorage()` method to defer closing the `backingStore_` and `localStore_` until after our ready handshake is successful. * This defers the shutdown of the `PrivHelper` until a successful ready handshake. I also update the takeover documentation here with the new logic (and fix some formatting issues) Reviewed By: simpkins Differential Revision: D20433433 fbshipit-source-id: f59e660922674d281957e80aee5049735b901a2c
This commit is contained in:
parent
8bb3b33f8a
commit
9944a5dff5
@ -12,33 +12,42 @@ library, client, server, data, and handler.
|
||||
|
||||
### Thrift serialization library
|
||||
|
||||
There are two main message classes:
|
||||
* `struct TakeoverVersionQuery` - A list of takeover data serialization versions
|
||||
There are three main message classes that are exchanged over the takeover socket:
|
||||
|
||||
* `struct TakeoverVersionQuery` - A list of takeover data serialization versions
|
||||
that the client supports
|
||||
* `union SerializedTakeoverData` - A list of `SerializedMountInfo` or a string
|
||||
* empty "ready" ping - An empty ping sent by the server to ensure the client is
|
||||
still alive and ready to receive takeover data
|
||||
* `union SerializedTakeoverData` - A list of `SerializedMountInfo` or a string
|
||||
error.
|
||||
** `struct SerializedMountInfo` - Contains the mount path, state directory, a
|
||||
list of bind mount paths (which is no longer used), connection information, and
|
||||
a `SerializedInodeMap`
|
||||
** `struct SerializedInodeMapEntry` - contains inode information like
|
||||
inodeNumber, parentInode, name, isUnlinked, numFuseReferences, hash, and mode.
|
||||
** `struct SerializedInodeMap` - A list of `SerializedInodeMapEntry` unloaded
|
||||
inodes
|
||||
** `struct SerializedFileHandleMap` - currently empty
|
||||
* `struct SerializedMountInfo` - Contains the mount path, state directory, a
|
||||
list of bind mount paths (which is no longer used), connection information, and
|
||||
a `SerializedInodeMap`
|
||||
* `struct SerializedInodeMap` - A list of `SerializedInodeMapEntry` unloaded
|
||||
inodes
|
||||
* `struct SerializedInodeMapEntry` - contains inode information like
|
||||
inodeNumber, parentInode, name, isUnlinked, numFuseReferences, hash,
|
||||
and mode.
|
||||
* `struct SerializedFileHandleMap` - currently empty
|
||||
|
||||
### Client
|
||||
|
||||
The client has one function - `takeoverMounts`. This function requests to take
|
||||
over mount points from an existing edenfs process. On success, it returns a
|
||||
`TakeoverData` object, and it throws an exception on error. It takes two
|
||||
parameters: a socketPath, and a set of integers of supported takeover versions.
|
||||
`TakeoverData` object, and it throws an exception on error. It takes three
|
||||
parameters: a socketPath, a bool shouldPing, and a set of integers of supported
|
||||
takeover versions. The last two parameters are for testing purposes and should
|
||||
not be used in productions builds.
|
||||
|
||||
This has a takeover timeout of 5 minutes for receiving takeover data from old
|
||||
process.
|
||||
|
||||
We connect to the socket at the given path, then send our send our protocol
|
||||
version so that the server knows whether we're capable of handshaking
|
||||
successfully. We then wait for the takeover data response.
|
||||
successfully. We then wait for the server to send us a "ready" ping, making sure
|
||||
we are still listening on the socket. We respond to this ping and then wait for
|
||||
the takeover data response. It is possible that we will not recieve this ping,
|
||||
and instead just recieve the takeover data response.
|
||||
|
||||
After we get the takeover data response, we either throw an exception if we do
|
||||
not get a message, or we deserialize the message and check its contents. We
|
||||
@ -62,10 +71,15 @@ It has a few functions:
|
||||
that the client process is from the same user ID, and that the client and
|
||||
server support a compatible takeover protocol version. If the versions are
|
||||
compatible, then the server starts to initiate shutdown by calling return
|
||||
`server_->getTakeoverHandler()->startTakeoverShutdown()` Then, it sends the
|
||||
takeover data over the takeover socket by serializing the information
|
||||
(version, lock file, thrift socket, mount file descriptor) or error, and
|
||||
sending it.
|
||||
`server_->getTakeoverHandler()->startTakeoverShutdown()`. After the shutdown
|
||||
is completed, the takeover server pings the takeover client to ensure it is
|
||||
still waiting for the data. If the ping is unsuccessful (timeout, error, etc),
|
||||
the takeover server stops the takeover process and returns the untransmitted
|
||||
`TakeoverData` in an exception in order to let the `EdenServer` recover itself
|
||||
and start serving again. Finally, it closes its storage (local and backing stores)
|
||||
and sends the takeover data over the takeover socket by serializing the
|
||||
information (version, lock file, thrift socket, mount file descriptor) or error,
|
||||
and sending it.
|
||||
* private functions:
|
||||
* `connectionAccepted` - callback function for allocating a connection
|
||||
handler when the server gets a client.
|
||||
@ -90,11 +104,17 @@ graceful takeover functionality. This is primarily implemented by the
|
||||
`EdenServer` class. However, there are also alternative implementations used
|
||||
for unit testing.
|
||||
|
||||
It has one pure virtual function called `startTakeoverShutdown()`.
|
||||
startTakeoverShutdown() will be called when a graceful shutdown has been
|
||||
It has two pure virtual functions: `startTakeoverShutdown()` and `closeStorage()`.
|
||||
|
||||
`startTakeoverShutdown()` will be called when a graceful shutdown has been
|
||||
requested, with a remote process attempting to take over the currently running
|
||||
mount points.
|
||||
|
||||
When implemented, this should return a Future that will produce the
|
||||
`TakeoverData` to send to the remote edenfs process once the edenfs process is
|
||||
ready to transfer its mounts.
|
||||
|
||||
`closeStorage()` will be called before sending the `TakeoverData` to the client,
|
||||
conditionally on a successful ready handshake (if applicable). This function should
|
||||
close storage used by the server. In the case of an `EdenServer`, this function
|
||||
allows for locks to be released in order for the new process to take over this storage.
|
||||
|
@ -94,6 +94,10 @@ std::string DefaultEdenMain::getLocalHostname() {
|
||||
return getHostname();
|
||||
}
|
||||
|
||||
void DefaultEdenMain::prepare(const EdenServer& /*server*/) {
|
||||
fb303::registerFollyLoggingOptionHandlers();
|
||||
}
|
||||
|
||||
void DefaultEdenMain::runServer(const EdenServer& server) {
|
||||
// ThriftServer::serve() will drive the current thread's EventBase.
|
||||
// Verify that we are being called from the expected thread, and will end up
|
||||
@ -102,7 +106,6 @@ void DefaultEdenMain::runServer(const EdenServer& server) {
|
||||
server.getMainEventBase(),
|
||||
folly::EventBaseManager::get()->getEventBase());
|
||||
|
||||
fb303::registerFollyLoggingOptionHandlers();
|
||||
fb303::withThriftFunctionStats(kServiceName, server.getHandler().get(), [&] {
|
||||
server.getServer()->serve();
|
||||
});
|
||||
@ -289,8 +292,15 @@ int runEdenMain(EdenMain&& main, int argc, char** argv) {
|
||||
DaemonStart{startTimeInSeconds, takeover, true /*success*/});
|
||||
});
|
||||
|
||||
main.runServer(server.value());
|
||||
server->performCleanup();
|
||||
main.prepare(server.value());
|
||||
while (true) {
|
||||
main.runServer(server.value());
|
||||
if (server->performCleanup()) {
|
||||
break;
|
||||
}
|
||||
// performCleanup() returns false if a takeover shutdown attempt
|
||||
// failed. Continue and re-run the server in this case.
|
||||
}
|
||||
|
||||
XLOG(INFO) << "edenfs exiting successfully";
|
||||
return EX_OK;
|
||||
|
@ -25,6 +25,7 @@ class EdenMain {
|
||||
virtual std::string getEdenfsBuildName() = 0;
|
||||
virtual std::string getEdenfsVersion() = 0;
|
||||
virtual std::string getLocalHostname() = 0;
|
||||
virtual void prepare(const EdenServer& server) = 0;
|
||||
virtual void runServer(const EdenServer& server) = 0;
|
||||
};
|
||||
|
||||
@ -36,6 +37,7 @@ class DefaultEdenMain : public EdenMain {
|
||||
virtual std::string getEdenfsBuildName() override;
|
||||
virtual std::string getEdenfsVersion() override;
|
||||
virtual std::string getLocalHostname() override;
|
||||
virtual void prepare(const EdenServer& server) override;
|
||||
virtual void runServer(const EdenServer& server) override;
|
||||
};
|
||||
|
||||
|
@ -539,6 +539,43 @@ void EdenServer::scheduleInodeUnload(std::chrono::milliseconds timeout) {
|
||||
},
|
||||
timeout);
|
||||
}
|
||||
|
||||
Future<Unit> EdenServer::recover(TakeoverData&& data) {
|
||||
return recoverImpl(std::move(data))
|
||||
.ensure(
|
||||
// Mark the server state as RUNNING once we finish setting up the
|
||||
// mount points. Even if an error occurs we still transition to the
|
||||
// running state. Additionally, set the takeoverShutdown state to
|
||||
// false in order to allow for future graceful restart requests.
|
||||
[this] {
|
||||
auto state = runningState_.wlock();
|
||||
state->takeoverShutdown = false;
|
||||
state->takeoverPromise = folly::Promise<TakeoverData>();
|
||||
state->state = RunState::RUNNING;
|
||||
});
|
||||
}
|
||||
|
||||
Future<Unit> EdenServer::recoverImpl(TakeoverData&& takeoverData) {
|
||||
auto thriftRunningFuture = createThriftServer();
|
||||
|
||||
const auto takeoverPath = edenDir_.getTakeoverSocketPath();
|
||||
|
||||
// Recover the eden lock file and the thrift server socket.
|
||||
edenDir_.takeoverLock(std::move(takeoverData.lockFile));
|
||||
server_->useExistingSocket(takeoverData.thriftSocket.release());
|
||||
|
||||
// Remount our mounts from our prepared takeoverData
|
||||
std::vector<Future<Unit>> mountFutures;
|
||||
mountFutures = prepareMountsTakeover(
|
||||
std::make_unique<ForegroundStartupLogger>(),
|
||||
std::move(takeoverData.mountPoints));
|
||||
|
||||
// Return a future that will complete only when all mount points have
|
||||
// started and the thrift server is also running.
|
||||
mountFutures.emplace_back(std::move(thriftRunningFuture));
|
||||
return folly::collectAllUnsafe(mountFutures).unit();
|
||||
}
|
||||
|
||||
#endif // !_WIN32
|
||||
|
||||
Future<Unit> EdenServer::prepare(
|
||||
@ -841,9 +878,35 @@ void EdenServer::incrementStartupMountFailures() {
|
||||
fb303::fbData->incrementCounter("startup_mount_failures");
|
||||
}
|
||||
|
||||
void EdenServer::performCleanup() {
|
||||
void EdenServer::closeStorage() {
|
||||
// Destroy the local store and backing stores.
|
||||
// We shouldn't access the local store any more after giving up our
|
||||
// lock, and we need to close it to release its lock before the new
|
||||
// edenfs process tries to open it.
|
||||
backingStores_.wlock()->clear();
|
||||
|
||||
// Explicitly close the LocalStore
|
||||
// Since we have a shared_ptr to it, other parts of the code can
|
||||
// theoretically still maintain a reference to it after the EdenServer is
|
||||
// destroyed. We want to ensure that it is really closed and no subsequent
|
||||
// I/O can happen to it after the EdenServer is shut down and the main Eden
|
||||
// lock is released.
|
||||
localStore_->close();
|
||||
}
|
||||
|
||||
bool EdenServer::performCleanup() {
|
||||
bool takeover = false;
|
||||
#ifndef _WIN32
|
||||
bool takeover;
|
||||
folly::stop_watch<> shutdown;
|
||||
bool shutdownSuccess = true;
|
||||
SCOPE_EXIT {
|
||||
auto shutdownTimeInSeconds =
|
||||
std::chrono::duration<double>{shutdown.elapsed()}.count();
|
||||
serverState_->getStructuredLogger()->logEvent(
|
||||
DaemonStop{shutdownTimeInSeconds, takeover, shutdownSuccess});
|
||||
};
|
||||
#endif
|
||||
|
||||
folly::File thriftSocket;
|
||||
{
|
||||
auto state = runningState_.wlock();
|
||||
@ -853,82 +916,70 @@ void EdenServer::performCleanup() {
|
||||
}
|
||||
state->state = RunState::SHUTTING_DOWN;
|
||||
}
|
||||
folly::stop_watch<> shutdown;
|
||||
auto shutdownFuture = takeover
|
||||
? performTakeoverShutdown(std::move(thriftSocket))
|
||||
: performNormalShutdown();
|
||||
#else
|
||||
auto shutdownFuture = performNormalShutdown();
|
||||
#endif
|
||||
: performNormalShutdown().thenValue([](auto&&) { return std::nullopt; });
|
||||
|
||||
// Drive the main event base until shutdownFuture completes
|
||||
CHECK_EQ(mainEventBase_, folly::EventBaseManager::get()->getEventBase());
|
||||
while (!shutdownFuture.isReady()) {
|
||||
mainEventBase_->loopOnce();
|
||||
}
|
||||
auto&& shutdownResult = shutdownFuture.getTry();
|
||||
#ifndef _WIN32
|
||||
std::move(shutdownFuture)
|
||||
.thenTry([shutdown,
|
||||
takeover,
|
||||
structuredLogger = serverState_->getStructuredLogger()](
|
||||
folly::Try<Unit>&& result) {
|
||||
auto shutdownTimeInSeconds =
|
||||
std::chrono::duration<double>{shutdown.elapsed()}.count();
|
||||
structuredLogger->logEvent(DaemonStop{
|
||||
shutdownTimeInSeconds, takeover, !result.hasException()});
|
||||
})
|
||||
.get();
|
||||
#else
|
||||
std::move(shutdownFuture).get();
|
||||
shutdownSuccess = !shutdownResult.hasException();
|
||||
|
||||
// We must check if the shutdownResult contains TakeoverData, and if so
|
||||
// we must recover
|
||||
if (shutdownResult.hasValue()) {
|
||||
auto&& shutdownValue = shutdownResult.value();
|
||||
if (shutdownValue.has_value()) {
|
||||
// shutdownValue only contains a value if a takeover was not successful.
|
||||
shutdownSuccess = false;
|
||||
XLOG(INFO)
|
||||
<< "edenfs encountered a takeover error, attempting to recover";
|
||||
// We do not wait here for the remounts to succeed, and instead will
|
||||
// let runServer() drive the mainEventBase loop to finish this call
|
||||
(void)recover(std::move(shutdownValue).value());
|
||||
return false;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
// Explicitly close the LocalStore
|
||||
// Since we have a shared_ptr to it, other parts of the code can theoretically
|
||||
// still maintain a reference to it after the EdenServer is destroyed.
|
||||
// We want to ensure that it is really closed and no subsequent I/O can happen
|
||||
// to it after the EdenServer is shut down and the main Eden lock is released.
|
||||
localStore_->close();
|
||||
closeStorage();
|
||||
// Stop the privhelper process.
|
||||
shutdownPrivhelper();
|
||||
shutdownResult.throwIfFailed();
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
Future<optional<TakeoverData>> EdenServer::performTakeoverShutdown(
|
||||
folly::File thriftSocket) {
|
||||
#ifndef _WIN32
|
||||
Future<Unit> EdenServer::performTakeoverShutdown(folly::File thriftSocket) {
|
||||
// stop processing new FUSE requests for the mounts,
|
||||
return stopMountsForTakeover().thenValue(
|
||||
[this,
|
||||
socket = std::move(thriftSocket)](TakeoverData&& takeover) mutable {
|
||||
// Destroy the local store and backing stores.
|
||||
// We shouldn't access the local store any more after giving up our
|
||||
// lock, and we need to close it to release its lock before the new
|
||||
// edenfs process tries to open it.
|
||||
backingStores_.wlock()->clear();
|
||||
// Explicit close the LocalStore to ensure we release the RocksDB lock.
|
||||
// Note that simply resetting the localStore_ pointer is insufficient,
|
||||
// since there may still be other outstanding reference counts to the
|
||||
// object.
|
||||
localStore_->close();
|
||||
|
||||
// Stop the privhelper process.
|
||||
shutdownPrivhelper();
|
||||
|
||||
takeover.lockFile = edenDir_.extractLock();
|
||||
auto future = takeover.takeoverComplete.getFuture();
|
||||
takeover.thriftSocket = std::move(socket);
|
||||
|
||||
takeoverPromise_.setValue(std::move(takeover));
|
||||
runningState_.wlock()->takeoverPromise.setValue(std::move(takeover));
|
||||
return future;
|
||||
});
|
||||
}
|
||||
#else
|
||||
NOT_IMPLEMENTED();
|
||||
#endif // !_WIN32
|
||||
}
|
||||
|
||||
Future<Unit> EdenServer::performNormalShutdown() {
|
||||
#ifndef _WIN32
|
||||
takeoverServer_.reset();
|
||||
|
||||
// Clean up all the server mount points before shutting down the privhelper.
|
||||
return unmountAll().thenTry([this](folly::Try<Unit>&& result) {
|
||||
shutdownPrivhelper();
|
||||
result.throwIfFailed();
|
||||
});
|
||||
// Return an uninitalized optional here to avoid an attempted recovery
|
||||
return unmountAll();
|
||||
#else
|
||||
NOT_IMPLEMENTED();
|
||||
#endif // !_WIN32
|
||||
@ -948,8 +999,6 @@ void EdenServer::shutdownPrivhelper() {
|
||||
<< privhelperExitCode;
|
||||
}
|
||||
}
|
||||
#else
|
||||
NOT_IMPLEMENTED();
|
||||
#endif
|
||||
}
|
||||
|
||||
@ -1448,6 +1497,7 @@ folly::Future<TakeoverData> EdenServer::startTakeoverShutdown() {
|
||||
// Make sure we aren't already shutting down, then update our state
|
||||
// to indicate that we should perform mount point takeover shutdown
|
||||
// once runServer() returns.
|
||||
auto result = Future<TakeoverData>::makeEmpty();
|
||||
{
|
||||
auto state = runningState_.wlock();
|
||||
if (state->state != RunState::RUNNING) {
|
||||
@ -1478,14 +1528,15 @@ folly::Future<TakeoverData> EdenServer::startTakeoverShutdown() {
|
||||
"error duplicating thrift server socket during graceful takeover");
|
||||
state->takeoverThriftSocket =
|
||||
folly::File{takeoverThriftSocket, /* ownsFd */ true};
|
||||
result = state->takeoverPromise.getFuture();
|
||||
}
|
||||
|
||||
shutdownSubscribers();
|
||||
|
||||
// Stop the thrift server. We will fulfill takeoverPromise_ once it
|
||||
// Stop the thrift server. We will fulfill takeoverPromise once it
|
||||
// stops.
|
||||
server_->stop();
|
||||
return takeoverPromise_.getFuture();
|
||||
return result;
|
||||
#else
|
||||
NOT_IMPLEMENTED();
|
||||
#endif // !_WIN32
|
||||
|
@ -150,6 +150,20 @@ class EdenServer : private TakeoverHandler {
|
||||
std::shared_ptr<StartupLogger> logger,
|
||||
bool waitForMountCompletion = true);
|
||||
|
||||
#ifndef _WIN32
|
||||
/**
|
||||
* Recover the EdenServer after a failed takeover request.
|
||||
*
|
||||
* This function is very similar to prepare() implementation-wise,
|
||||
* but uses a TakeoverData object from a failed takeover request
|
||||
* to recover itself.
|
||||
*
|
||||
* This function resets the TakeoverServer, resets the takeoverPromise,
|
||||
* sets takeoverShutdown to false, and sets the state to RUNNING
|
||||
*/
|
||||
FOLLY_NODISCARD folly::Future<folly::Unit> recover(TakeoverData&& data);
|
||||
#endif // _WIN32
|
||||
|
||||
/**
|
||||
* Shut down the EdenServer after it has stopped running.
|
||||
*
|
||||
@ -161,7 +175,12 @@ class EdenServer : private TakeoverHandler {
|
||||
* Otherwise performCleanup() will unmount and shutdown all currently running
|
||||
* mounts.
|
||||
*/
|
||||
void performCleanup();
|
||||
bool performCleanup();
|
||||
|
||||
/**
|
||||
* Close the backingStore and the localStore.
|
||||
*/
|
||||
void closeStorage() override;
|
||||
|
||||
/**
|
||||
* Stops this server, which includes the underlying Thrift server.
|
||||
@ -410,6 +429,13 @@ class EdenServer : private TakeoverHandler {
|
||||
std::shared_ptr<StartupLogger> logger);
|
||||
static void incrementStartupMountFailures();
|
||||
|
||||
#ifndef _WIN32
|
||||
/**
|
||||
* recoverImpl() contains the bulk of the implementation of recover()
|
||||
*/
|
||||
FOLLY_NODISCARD folly::Future<folly::Unit> recoverImpl(TakeoverData&& data);
|
||||
#endif // !_WIN32
|
||||
|
||||
/**
|
||||
* Create config file if this the first time running the server, otherwise
|
||||
* parse existing config file.
|
||||
@ -435,8 +461,11 @@ class EdenServer : private TakeoverHandler {
|
||||
std::optional<TakeoverData::MountInfo> takeover);
|
||||
|
||||
FOLLY_NODISCARD folly::Future<folly::Unit> performNormalShutdown();
|
||||
FOLLY_NODISCARD folly::Future<folly::Unit> performTakeoverShutdown(
|
||||
folly::File thriftSocket);
|
||||
// If the takeover was successful, this returns std::nullopt. If a
|
||||
// TakeoverData object is returned, that means the takeover attempt failed and
|
||||
// the server should be resumed using the given TakeoverData.
|
||||
FOLLY_NODISCARD folly::Future<std::optional<TakeoverData>>
|
||||
performTakeoverShutdown(folly::File thriftSocket);
|
||||
void shutdownPrivhelper();
|
||||
|
||||
// Starts up a new fuse mount for edenMount, starting up the thread
|
||||
@ -536,8 +565,6 @@ class EdenServer : private TakeoverHandler {
|
||||
* a graceful restart, taking over our running mount points.
|
||||
*/
|
||||
std::unique_ptr<TakeoverServer> takeoverServer_;
|
||||
folly::Promise<TakeoverData> takeoverPromise_;
|
||||
|
||||
#endif // !_WIN32
|
||||
|
||||
/**
|
||||
@ -548,6 +575,7 @@ class EdenServer : private TakeoverHandler {
|
||||
RunState state{RunState::STARTING};
|
||||
bool takeoverShutdown{false};
|
||||
folly::File takeoverThriftSocket;
|
||||
folly::Promise<TakeoverData> takeoverPromise;
|
||||
};
|
||||
folly::Synchronized<RunStateData> runningState_;
|
||||
|
||||
|
@ -32,6 +32,7 @@ namespace eden {
|
||||
|
||||
TakeoverData takeoverMounts(
|
||||
AbsolutePathPiece socketPath,
|
||||
bool shouldPing,
|
||||
const std::set<int32_t>& supportedVersions) {
|
||||
folly::EventBase evb;
|
||||
folly::Expected<UnixSocket::Message, folly::exception_wrapper>
|
||||
@ -56,16 +57,23 @@ TakeoverData takeoverMounts(
|
||||
auto timeout = std::chrono::seconds(FLAGS_takeoverReceiveTimeout);
|
||||
return socket.receive(timeout);
|
||||
})
|
||||
.thenValue([&socket](UnixSocket::Message&& msg) {
|
||||
.thenValue([&socket, shouldPing](UnixSocket::Message&& msg) {
|
||||
if (TakeoverData::isPing(&msg.data)) {
|
||||
// Just send an empty message back here, the server knows it sent a
|
||||
// ping so it does not need to parse the message.
|
||||
UnixSocket::Message ping;
|
||||
return socket.send(std::move(ping)).thenValue([&socket](auto&&) {
|
||||
// Wait for the takeover data response
|
||||
auto timeout = std::chrono::seconds(FLAGS_takeoverReceiveTimeout);
|
||||
return socket.receive(timeout);
|
||||
});
|
||||
if (shouldPing) {
|
||||
// Just send an empty message back here, the server knows it sent a
|
||||
// ping so it does not need to parse the message.
|
||||
UnixSocket::Message ping;
|
||||
return socket.send(std::move(ping)).thenValue([&socket](auto&&) {
|
||||
// Wait for the takeover data response
|
||||
auto timeout = std::chrono::seconds(FLAGS_takeoverReceiveTimeout);
|
||||
return socket.receive(timeout);
|
||||
});
|
||||
} else {
|
||||
// This should only be hit during integration tests.
|
||||
return folly::makeFuture<UnixSocket::Message>(
|
||||
folly::exception_wrapper(std::runtime_error(
|
||||
"ping received but should not respond")));
|
||||
}
|
||||
} else {
|
||||
// Older versions of EdenFS will not send a "ready" ping and
|
||||
// could simply send the takeover data.
|
||||
|
@ -20,8 +20,9 @@ namespace eden {
|
||||
*/
|
||||
TakeoverData takeoverMounts(
|
||||
AbsolutePathPiece socketPath,
|
||||
// this parameter is present for testing purposes and should not normally
|
||||
// be used in the production build.
|
||||
// the following parameters are present for testing purposes and should not
|
||||
// normally be used in the production build.
|
||||
bool shouldPing = true,
|
||||
const std::set<int32_t>& supportedTakeoverVersions =
|
||||
kSupportedTakeoverVersions);
|
||||
|
||||
|
@ -153,7 +153,7 @@ class TakeoverData {
|
||||
* The takeoverComplete promise will be fulfilled by the TakeoverServer code
|
||||
* once the TakeoverData has been sent to the remote process.
|
||||
*/
|
||||
folly::Promise<folly::Unit> takeoverComplete;
|
||||
folly::Promise<std::optional<TakeoverData>> takeoverComplete;
|
||||
|
||||
private:
|
||||
/**
|
||||
|
@ -38,6 +38,8 @@ class TakeoverHandler {
|
||||
* its mounts.
|
||||
*/
|
||||
virtual folly::Future<TakeoverData> startTakeoverShutdown() = 0;
|
||||
|
||||
virtual void closeStorage() = 0;
|
||||
};
|
||||
|
||||
} // namespace eden
|
||||
|
@ -208,10 +208,15 @@ Future<Unit> TakeoverServer::ConnHandler::pingThenSendTakeoverData(
|
||||
folly::Try<UnixSocket::Message>&& msg) mutable {
|
||||
if (msg.hasException()) {
|
||||
// If we got an exception on sending or receiving here, we should
|
||||
// bubble up an exception and recover. It is important to mark the
|
||||
// takeover as completed here so the promise fulfills and we
|
||||
// continue the cleanup process inside of EdenServer
|
||||
data.takeoverComplete.setException(msg.exception());
|
||||
// bubble up an exception and recover.
|
||||
|
||||
// We must save the original takeoverComplete promise
|
||||
// since we will move the TakeoverData into the takeoverComplete
|
||||
// promise and the EdenServer waits on this to be fulfilled to
|
||||
// determine to recover or not
|
||||
auto takeoverPromise = std::move(data.takeoverComplete);
|
||||
takeoverPromise.setValue(std::move(data));
|
||||
|
||||
return makeFuture<Unit>(msg.exception());
|
||||
}
|
||||
return sendTakeoverData(std::move(data));
|
||||
@ -220,6 +225,11 @@ Future<Unit> TakeoverServer::ConnHandler::pingThenSendTakeoverData(
|
||||
|
||||
Future<Unit> TakeoverServer::ConnHandler::sendTakeoverData(
|
||||
TakeoverData&& data) {
|
||||
// Before sending the takeover data, we must close the server's
|
||||
// local and backing store. This is important for ensuring the RocksDB
|
||||
// lock is released so the client can take over.
|
||||
server_->getTakeoverHandler()->closeStorage();
|
||||
|
||||
UnixSocket::Message msg;
|
||||
try {
|
||||
msg.data = data.serialize(protocolVersion_);
|
||||
@ -240,7 +250,12 @@ Future<Unit> TakeoverServer::ConnHandler::sendTakeoverData(
|
||||
return socket_.send(std::move(msg))
|
||||
.thenTry([promise = std::move(data.takeoverComplete)](
|
||||
folly::Try<Unit>&& sendResult) mutable {
|
||||
promise.setTry(std::move(sendResult));
|
||||
if (sendResult.hasException()) {
|
||||
promise.setException(sendResult.exception());
|
||||
} else {
|
||||
// Set an uninitalized optional here to avoid an attempted recovery
|
||||
promise.setValue(std::nullopt);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
|
@ -42,6 +42,8 @@ class TestHandler : public TakeoverHandler {
|
||||
return makeFuture(std::move(data_));
|
||||
}
|
||||
|
||||
void closeStorage() override {}
|
||||
|
||||
private:
|
||||
TakeoverData data_;
|
||||
};
|
||||
@ -55,6 +57,7 @@ class ErrorHandler : public TakeoverHandler {
|
||||
return makeFuture<TakeoverData>(
|
||||
std::logic_error("purposely failing for testing"));
|
||||
}
|
||||
void closeStorage() override {}
|
||||
};
|
||||
|
||||
/**
|
||||
@ -70,7 +73,9 @@ Future<TakeoverData> takeoverViaEventBase(
|
||||
std::thread thread([path = AbsolutePath{socketPath},
|
||||
supportedVersions,
|
||||
promise = std::move(promise)]() mutable {
|
||||
promise.setWith([&] { return takeoverMounts(path, supportedVersions); });
|
||||
promise.setWith([&] {
|
||||
return takeoverMounts(path, /*shouldPing=*/true, supportedVersions);
|
||||
});
|
||||
});
|
||||
|
||||
return future.via(evb).ensure(
|
||||
|
@ -22,6 +22,11 @@ DEFINE_string(edenDir, "", "The path to the .eden directory");
|
||||
*/
|
||||
DEFINE_int32(takeoverVersion, 0, "The takeover version number to send");
|
||||
|
||||
DEFINE_bool(
|
||||
shouldPing,
|
||||
true,
|
||||
"This is used by integration tests to avoid sending a ping");
|
||||
|
||||
FOLLY_INIT_LOGGING_CONFIG("eden=DBG2");
|
||||
|
||||
using namespace facebook::eden::path_literals;
|
||||
@ -47,10 +52,11 @@ int main(int argc, char* argv[]) {
|
||||
|
||||
facebook::eden::TakeoverData data;
|
||||
if (FLAGS_takeoverVersion == 0) {
|
||||
data = facebook::eden::takeoverMounts(takeoverSocketPath);
|
||||
data = facebook::eden::takeoverMounts(takeoverSocketPath, FLAGS_shouldPing);
|
||||
} else {
|
||||
auto takeoverVersion = std::set<int32_t>{FLAGS_takeoverVersion};
|
||||
data = facebook::eden::takeoverMounts(takeoverSocketPath, takeoverVersion);
|
||||
data = facebook::eden::takeoverMounts(
|
||||
takeoverSocketPath, FLAGS_shouldPing, takeoverVersion);
|
||||
}
|
||||
for (const auto& mount : data.mountPoints) {
|
||||
XLOG(INFO) << "mount " << mount.mountPath << ": fd=" << mount.fuseFD.fd();
|
||||
|
@ -191,6 +191,17 @@ class EdenFS(object):
|
||||
cmd.extend(args)
|
||||
return cmd
|
||||
|
||||
def wait_for_is_healthy(self, timeout: float = 30) -> bool:
|
||||
process = self._process
|
||||
assert process is not None
|
||||
health = util.wait_for_daemon_healthy(
|
||||
proc=process,
|
||||
config_dir=self._eden_dir,
|
||||
get_client=lambda: self.get_thrift_client(),
|
||||
timeout=timeout,
|
||||
)
|
||||
return health.is_healthy()
|
||||
|
||||
def start(
|
||||
self,
|
||||
timeout: float = 60,
|
||||
@ -407,6 +418,27 @@ class EdenFS(object):
|
||||
]
|
||||
self.run_takeover_tool(cmd)
|
||||
|
||||
def takeover_without_ping_response(self) -> None:
|
||||
"""
|
||||
Execute a fake takeover to explicitly test a failed takeover. The
|
||||
takeover client does not send a ping with the nosendPing flag,
|
||||
so the subprocess call will throw, and we expect the old process
|
||||
to recover
|
||||
"""
|
||||
# pyre-ignore[9]: T38947910
|
||||
cmd: List[str] = [
|
||||
FindExe.TAKEOVER_TOOL,
|
||||
"--edenDir",
|
||||
str(self._eden_dir),
|
||||
"--noshouldPing",
|
||||
]
|
||||
|
||||
try:
|
||||
subprocess.check_call(cmd)
|
||||
except Exception:
|
||||
# We expect the new process to fail starting.
|
||||
pass
|
||||
|
||||
def add_repository(self, name: str, repo_path: str) -> None:
|
||||
"""
|
||||
Run "eden repository" to define a repository configuration
|
||||
|
@ -304,6 +304,12 @@ class TakeoverTest(testcase.EdenRepoTest):
|
||||
"""
|
||||
self.eden.fake_takeover_with_version(3)
|
||||
|
||||
def test_takeover_failure(self) -> None:
|
||||
print("=== beginning restart ===", file=sys.stderr)
|
||||
self.eden.takeover_without_ping_response()
|
||||
print("=== restart complete ===", file=sys.stderr)
|
||||
self.assertTrue(self.eden.wait_for_is_healthy())
|
||||
|
||||
|
||||
@testcase.eden_repo_test
|
||||
class TakeoverRocksDBStressTest(testcase.EdenRepoTest):
|
||||
|
Loading…
Reference in New Issue
Block a user