marian/src/examples
Marcin Junczys-Dowmunt 7d2045a907 Merged PR 25686: Loading checkpoints from main node only via MPI
Enables loading of model checkpoints from main node only via MPI.

Until now the checkpoint needed to present in the same location on all nodes. That could be done either via writing to a shared filesystem (problematic due to bad syncing) or by manual copying to the same local location, e.g. /tmp on each node (while writing only happened to one main location).

Now, marian can resume training from only one location on the main node. The remaining nodes do not need to have access. E.g. local /tmp on the main node can be used, or race conditons on shared storage are avoided.

Also avoids creating files for logging on more than one node. This is a bit wonky, done via environment variable lookup.
2022-09-21 20:39:54 +00:00
..
iris Merged PR 17337: fp16 support for training 2021-01-28 16:15:44 +00:00
mnist Merged PR 25686: Loading checkpoints from main node only via MPI 2022-09-21 20:39:54 +00:00
CMakeLists.txt make compile with CUDNN and static libs, suppress CMAKE dev warning from 3rd party tools 2019-07-04 16:26:32 -07:00
README.md Organize examples 2017-06-04 15:07:24 +02:00

Marian examples

Examples are enabled with CMake option -DCOMPILE_EXAMPLES=ON.

MNIST

You will need MNIST data for training and testing. Download them with the script src/examples/mnist/download.sh or provide paths to the files with --train-sets and --valid-sets options.