marian/VERSION
Marcin Junczys-Dowmunt 7d2045a907 Merged PR 25686: Loading checkpoints from main node only via MPI
Enables loading of model checkpoints from main node only via MPI.

Until now the checkpoint needed to present in the same location on all nodes. That could be done either via writing to a shared filesystem (problematic due to bad syncing) or by manual copying to the same local location, e.g. /tmp on each node (while writing only happened to one main location).

Now, marian can resume training from only one location on the main node. The remaining nodes do not need to have access. E.g. local /tmp on the main node can be used, or race conditons on shared storage are avoided.

Also avoids creating files for logging on more than one node. This is a bit wonky, done via environment variable lookup.
2022-09-21 20:39:54 +00:00

2 lines
8 B
Plaintext