Improved handling of SIGTERM (#660)

* Return exit code 15 (SIGTERM) after SIGTERM.
When marian receives signal SIGTERM and exits gracefully (save model & exit),
it should then exit with a non-zero exit code, to signal to any parent process
that it did not exit "naturally".
* Added explanatory comment about exiting marian_train with non-zero status after SIGTERM.
* Bug fix: better handling of SIGTERM for graceful shutdown during training.
Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate
thread, would ignore SIGTERM during training (training uses a custom signal handler
for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e.,
save models and current state of training before shutting down).

The changes in this commit also facilitate custom handling of other signals in the
future by providing a general singal handler for all signals with a signal number
below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig))
for checking such flags.
This commit is contained in:
Ulrich Germann 2020-09-01 05:13:41 +01:00 committed by GitHub
parent 96738fea25
commit 044b416af5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 144 additions and 63 deletions

View File

@ -39,6 +39,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Properly record cmake variables in the cmake build directory instead of the source tree.
- Added default "none" for option shuffle in BatchGenerator, so that it works in executables where shuffle is not an option.
- Added a few missing header files in shortlist.h and beam_search.h.
- Improved handling for receiving SIGTERM during training. By default, SIGTERM triggers 'save (now) and exit'. Prior to this fix, batch pre-fetching did not check for this sigal, potentially delaying exit considerably. It now pays attention to that. Also, the default behaviour of save-and-exit can now be disabled on the command line with --sigterm exit-immediately.
### Changed
- Move Simple-WebSocket-Server to submodule

View File

@ -25,6 +25,7 @@ add_library(marian STATIC
common/filesystem.cpp
common/file_stream.cpp
common/file_utils.cpp
common/signal_handling.cpp
common/types.cpp
data/alignment.cpp
@ -99,7 +100,6 @@ add_library(marian STATIC
training/graph_group_singleton.cpp
training/validator.cpp
training/communicator.cpp
training/scheduler.cpp
# this is only compiled to catch build errors, but not linked
microsoft/quicksand.cpp

View File

@ -1,6 +1,7 @@
#include <signal.h>
#include "marian.h"
#include "common/signal_handling.h"
#include "training/graph_group_async.h"
#include "training/graph_group_singleton.h"
#include "training/graph_group_sync.h"
@ -42,14 +43,12 @@ int main(int argc, char** argv) {
New<Train<AsyncGraphGroup>>(options)->run();
}
}
// If we exit due to SIGTERM, exit with 128 + the signal number, as suggested
// for bash in http://tldp.org/LDP/abs/html/exitcodes.html. This allows parent
// If we exit due to a graceful exit request via SIGTERM, exit with 128 + SIGTERM,
// as suggested for bash in http://tldp.org/LDP/abs/html/exitcodes.html. This allows parent
// scripts to determine if training terminated naturally or via SIGTERM.
// Whith this approach we can accommodate additional signals in the future.
// An alternative would be to return 124, which is what the timeout command
// An alternative would be to exit with code 124, which is what the timeout command
// returns for timeout -s SIGTERM <seconds> ...., because exiting after SIGTERM
// is not technically a fatal error (which is what the 128+x convention usually
// stands for).
return getSigtermFlag() ? (128 + SIGTERM) : 0;
exit(getSignalFlag(SIGTERM) ? 128 + SIGTERM : EXIT_SUCCESS);
}

View File

@ -143,6 +143,15 @@ void ConfigParser::addOptionsGeneral(cli::CLIWrapper& cli) {
cli.add<std::string>("--dump-config",
"Dump current (modified) configuration to stdout and exit. Possible values: full, minimal, expand")
->implicit_val("full");
if(mode_ == cli::mode::training) {
// --sigterm is deliberately not a boolean, to allow for a consistent
// pattern of specifying custom signal handling in the future.
// (e.g., dump model but continue training upon SIGUSR1, or report current
// training status upon SIGINFO.)
cli.add<std::string>("--sigterm",
"What to do with SIGTERM: save-and-exit or exit-immediately.",
"save-and-exit");
}
// clang-format on
}

View File

@ -0,0 +1,58 @@
#include "common/logging.h"
#include "signal_handling.h"
// The simplest (and recommended) way to handle signals is to simply set a flag
// in the signal handler and check that flag later.
//
// We provide setSignalFlag as the most generic signal handler. This handler uses a
// single sig_atomic_t as a bit field. On Linux, sig_atomic_t is equivalent to a signed int,
// theoretically providing 32 binary flags; in practice, most likely signals for which we may
// want to install signal handlers are
// - SIGTERM (15): which by default signals the request for a graceful shutdown
// - SIGUSR1 (10): intended for custom use, default action in Linux is termination
// - SIGUSR2 (12): intended for custom use, default action in Linux is termination
// - SIGINT (2): interrupt from the console
// Just to be safe, we accommodate signals up to signal No. 30.
// In addition, we also provide requestSaveAndExit() and saveAndExit() as a signal
// handler/checker for graceful shutdown requests during training.
constexpr int maxSignalForSetSignalFlag{30};
// Make sure sig_atomic_t is large enough as a bit field for our purposes.
// That said, I'm not aware of any platform where this would be a problem.
static_assert(SIG_ATOMIC_MAX > (1U<<maxSignalForSetSignalFlag),
"sig_atomic_type is too small for signal flags on this platform.");
namespace marian{
volatile std::sig_atomic_t sigflags_{0};
volatile std::sig_atomic_t saveAndExit_{0};
void setSignalFlag(int sig) {
// sigflags_ is an int type serving as a bit filed for flags corresponding
// to signals (lower or equeal to maxSignalForSetSignalFlag). We set the
// flag by a binary or (|=) of the bit field and an int value with exactly
// one bit set (s^sig).
sigflags_ |= (1<<sig);
}
// Check if the flag for the signal sig is set in the bit field sigflags_
bool getSignalFlag(const int sig) {
ABORT_IF(sig > maxSignalForSetSignalFlag,
"Signal out of range (must be < {}, is {}).", maxSignalForSetSignalFlag, sig);
// Do bitwise AND between sigflags_ and an int value that has exactly one bit set that
// corresponds to the signal in question. If the bit is set (see setSignalFlag above),
// the bitwise AND will return a non-zero integer, if it is not set, the result will
// be zero.
return (sigflags_ & (1<<sig)) != 0;
}
void requestSaveAndExit(int sig) {
setSignalFlag(sig); // keep track of triggering signal
saveAndExit_ = 1; // set flag to exit gracefully
}
bool saveAndExitRequested() {
return saveAndExit_ == 1;
}
}

View File

@ -0,0 +1,39 @@
#pragma once
#include <csignal>
#include <string>
// SIGNAL HANDLING
// The signal handlers (and checkers) here are implemented in line with with the recommendations
// for signal handling in the SEI CERT C Coding Standard, specifically
//
// - SIG30-C:
// https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers
//
// - SIG31-C:
// https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers
//
// The exact behavior of 'graceful exit' depends on the application; for training, it means 'save model and exit',
// for a server (not implemented yet): 'block new requests but serve pending requests and then exit'.
//
// Graceful exit for training is useful for training on clusters with time limits on jobs. Slurm, for example, can be
// set up to send a custom signal at a set time before the end of the time slot, giving Marian time to save its current
// state before getting killed.
namespace marian {
/// Request graceful exit (signal handler)
void requestSaveAndExit(int sig);
/// Check if graceful exit was requested.
bool saveAndExitRequested();
/// General purpose signal handler that simply sets a flag when a signal is received.
// (only for SIGNAL No. < 32).
void setSignalFlag(int sig); // custom handler (set flag) for sig
/// Check if a setSignalFlag was triggered for this signal
bool getSignalFlag(int sig);
} // End of namespace marian

View File

@ -1,6 +1,7 @@
#pragma once
#include "common/options.h"
#include "common/signal_handling.h"
#include "data/batch_stats.h"
#include "data/rng_engine.h"
#include "training/training_state.h"
@ -136,6 +137,8 @@ private:
}
size_t sets = 0;
while(current_ != data_->end() && maxiBatch->size() < maxSize) { // loop over data
if (saveAndExitRequested()) // stop generating batches
return std::deque<BatchPtr>();
maxiBatch->push(*current_);
sets = current_->size();
// do not consume more than required for the maxi batch as this causes
@ -161,6 +164,8 @@ private:
if (stats_)
cachedStatsIter = stats_->begin();
while(!maxiBatch->empty()) { // while there are sentences in the queue
if (saveAndExitRequested()) // stop generating batches
return std::deque<BatchPtr>();
// push item onto batch
batchVector.push_back(maxiBatch->top());
maxiBatch->pop(); // fetch next-shortest
@ -249,7 +254,7 @@ private:
"If you have changed the training corpus, add --no-restore-corpus to the training command and run it again.");
bufferedBatches_ = std::move(futureBufferedBatches_.get());
// if bg thread returns an empty swath, we hit the end of the epoch
if (bufferedBatches_.empty()) {
if (bufferedBatches_.empty() || saveAndExitRequested()) {
return nullptr;
}
// and kick off the next bg operation
@ -257,7 +262,7 @@ private:
} else { // don't spawn any threads, i.e. batch fetching is blocking.
bufferedBatches_ = fetchBatches();
// if bufferedBatches is empty we hit the end of the epoch
if (bufferedBatches_.empty()) {
if (bufferedBatches_.empty() || saveAndExitRequested()) {
return nullptr;
}
}

View File

@ -1,43 +0,0 @@
#include "scheduler.h"
#include <signal.h>
#include <cassert>
namespace marian {
// SIGNAL HANDLING, see scheduler.cpp for definitions
// Currently, only the following is handled by a custom signal handler:
// SIGTERM: When SIGTERM is received, the global (static member) flag sigterm_ (false by default) is set to true
// by signalHandler(). When sigterm_ is true, keepGoing() returns false, and the current state of training models
// is saved prior to exiting.
// This functionality is helpful when training on clusters with time limits on compute slots, e.g., on s
// clusters managed by slurm. Slurm can be asked to sending a (custom) warning signal to a process at a given
// point in time prior to the hard "time's up".
bool sigterm_{false}; // flag signalling that SIGTERM has been received false by default, set to true by signalHandler(SIGTERM)
void signalHandler(int sig) {
// Note: sys_siglist[sig] or stdsignal() describe the effect (e.g.,
// 'Terminated' rather than provide the signal name (which are #define(s)
// in signal.h), so we have to do custom log messages here.
switch (sig) {
case SIGTERM: // save models and exit
LOG(info, "[training] Scheduler received signal SIGTERM"); // @TODO: figure out if this is safe. The logs are global and thread-safe, so should be OK?
sigterm_ = true;
break;
default:
ABORT("No action defined for signal {}", sig);
}
}
// installs signalHandler() for select signals (currently only SIGTERM)
void installSignalHandlers() {
// TODO: use sigaction instead of signal,
// cf. https://stackoverflow.com/questions/231912/what-is-the-difference-between-sigaction-and-signal
signal(SIGTERM, signalHandler);
}
bool getSigtermFlag() {
return sigterm_;
}
}

View File

@ -1,6 +1,7 @@
#pragma once
#include "common/options.h"
#include "common/signal_handling.h"
#include "training/training_state.h"
#include "training/validator.h"
#include "training/communicator.h"
@ -8,9 +9,6 @@
namespace marian {
bool getSigtermFlag();
void installSignalHandlers();
class Scheduler : public TrainingObserver {
private:
Ptr<Options> options_;
@ -154,11 +152,10 @@ public:
: options_(options), state_(state) {
ABORT_IF(state_->factor != 1, "state.factor unexpectedly not 1 at this point??");
updateLearningRate(*state);
installSignalHandlers();
}
bool keepGoing() {
if(getSigtermFlag()) // received signal SIGERM => exit gracefully
if(saveAndExitRequested()) // via SIGTERM
return false;
// stop if it reached the maximum number of epochs
@ -192,13 +189,12 @@ public:
void started() { LOG(info, "Training started"); }
void finished() {
if (getSigtermFlag())
LOG(info, "Training interrupted (SIGTERM).");
if (saveAndExitRequested())
LOG(info, "Training interrupted (via signal).");
else
LOG(info, "Training finished");
}
void addValidator(Ptr<ValidatorBase> validator) {
validators_.push_back(validator);
@ -223,9 +219,10 @@ public:
void validate(const std::vector<Ptr<ExpressionGraph>>& graphs,
bool isFinal = false) {
// Do not validate if already validated (for instance, after the model is
// loaded) or if validation is scheduled for another update, or when signal SIGTERM was received
if(getSigtermFlag() // SIGTERM was received
// Do not validate if already validated (for instance, after the model is loaded)
// or if validation is scheduled for another update, or when a graceful shutdown
// was requested.
if(saveAndExitRequested()
|| state_->validated // already validated (in resumed training, for example)
|| (!state_->enteredNewPeriodOf(options_->get<std::string>("valid-freq")) && !isFinal)) // not now
return;

View File

@ -16,6 +16,7 @@ template <class ModelWrapper>
class Train : public ModelTask {
private:
Ptr<Options> options_;
void installCustomSignalHandlers();
public:
Train(Ptr<Options> options) : options_(options) {}
@ -77,6 +78,9 @@ public:
bool restored = !options_->get<bool>("no-restore-corpus")
&& batchGenerator->restore(trainState);
// We only want custom behavior once training starts.
installCustomSignalHandlers();
// -- main training loop
scheduler->started();
while(scheduler->keepGoing()) {
@ -107,4 +111,16 @@ public:
finalizeMPI(std::move(mpi));
}
};
template <class ModelWrapper>
void Train<ModelWrapper>::installCustomSignalHandlers(){
const std::string sigTermAction = options_->get<std::string>("sigterm");
if (sigTermAction == "save-and-exit") {
LOG(debug, "Will save before exiting upon SIGTERM.");
signal(SIGTERM, requestSaveAndExit);
}
else if (sigTermAction != "exit-immediately")
ABORT("Unrecognized value '{}' for --sigterm", sigTermAction);
}
} // namespace marian