From c0cb356417915526b8ff40a2388ea32c83c34ab8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=B6rg=20Tiedemann?= Date: Sat, 12 Sep 2020 12:01:02 +0300 Subject: [PATCH] added acknowledgements --- Makefile | 3 +- README.md | 89 +++++++++++++++++++++------------------------------- doc/Setup.md | 24 ++++++++++++-- lib/env.mk | 1 + 4 files changed, 60 insertions(+), 57 deletions(-) diff --git a/Makefile b/Makefile index c3f9d158..df8a89b4 100644 --- a/Makefile +++ b/Makefile @@ -175,7 +175,8 @@ all: ${WORKDIR}/config.mk ${MAKE} eval ${MAKE} compare - +.PHONY: install +install: install-prerequisites #--------------------------------------------------------------------- diff --git a/README.md b/README.md index 3e64a9df..afe07baa 100644 --- a/README.md +++ b/README.md @@ -8,23 +8,25 @@ This package includes scripts for training NMT models using MarianNMT and OPUS d The subdirectory [models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models) contains information about pre-trained models that can be downloaded from this project. They are distribted with a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license. -## Prerequisites +## Quickstart -Running the scripts does not work out of the box because many settings are adjusted for the local installations on our IT infrastructure at [CSC](https://docs.csc.fi/). Here is an incomplete list of prerequisites needed for running a process. It is on our TODO list to make the training procedures and settings more transparent and self-contained. Preliminary information about [installation and setup is available here](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/Setup.md). +Setting up: -* [marian-nmt](https://github.com/marian-nmt/): The essential NMT toolkit we use in OPUS-MT; make sure you compile a version with GPU and SentencePiece support! -* [Moses scripts](https://github.com/moses-smt/mosesdecoder): various pre- and post-processing scripts from the Moses SMT toolkit (also bundled here: [marian-nmt](https://github.com/marian-nmt/moses-scripts)) -* [OpusTools](https://pypi.org/project/opustools): library and tools for accessing OPUS data -* [OpusTools-perl](https://github.com/Helsinki-NLP/OpusTools-perl): additional tools for accessing OPUS data -* [iso-639](https://pypi.org/project/iso-639/): a Python package for ISO 639 language codes -* Perl modules [ISO::639::3](https://metacpan.org/pod/ISO::639::3) and [ISO::639::5](https://metacpan.org/pod/ISO::639::5) -* [jq JSON processor](https://stedolan.github.io/jq/) +``` +git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git +git submodule update --init --recursive --remote +make install +``` -Optional (recommended) software: +Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English): -* [terashuf](https://github.com/alexandres/terashuf): efficiently shuffle massive data sets -* [pigz](https://zlib.net/pigz/): multithreaded gzip -* [efmomal](https://github.com/robertostling/eflomal) (needed for word alignment when transformer-align is used) +``` +make SRCLANGS="fi et" TRGLANGS="da sv en" train +make SRCLANGS="fi et" TRGLANGS="da sv en" eval +make SRCLANGS="fi et" TRGLANGS="da sv en" release +``` + +More information is available in the documentation linked below. ## Documentation @@ -37,50 +39,29 @@ Optional (recommended) software: * [How to train models for the Tatoeba MT Challenge](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/TatoebaChallenge.md) -## Structure of the training scripts +## References -Essential files for making new models: - -* `Makefile`: top-level makefile -* `lib/env.mk`: system-specific environment (now based on CSC machines) -* `lib/config.mk`: essential model configuration -* `lib/data.mk`: data pre-processing tasks -* `lib/generic.mk`: generic implicit rules that can extend other tasks -* `lib/dist.mk`: make packages for distributing models (CSC ObjectStorage based) -* `lib/slurm.mk`: submit jobs with SLURM - -There are also make targets for specific projects and tasks. Look into `lib/projects/` to see what has been defined already. -Note that this frequently changes! Check the file `lib/projects.mk` to see what kind of project files are enabled. There are currently, for example: - -* `lib/projects/multilingual.mk`: various multilingual models -* `lib/projects/celtic.mk`: data and models for Celtic languages -* `lib/projects/doclevel.mk`: experimental document-level models - - -Run this if you want to train a model, for example for translating English to French: +Please, cite the following paper if you use OPUS-MT software and models: ``` -make SRCLANGS=en TRGLANGS=fr train -``` - -To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run: - -``` -make SRCLANGS=en TRGLANGS=fr eval -``` - -For multilingual (more than one language on either side) models run, for example: - -``` -make SRCLANGS="de en" TRGLANGS="fr es pt" train -make SRCLANGS="de en" TRGLANGS="fr es pt" eval -``` - -Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads: - -``` -make -j 8 SRCLANG=en TRGLANG=fr data -``` +@InProceedings{TiedemannThottingal:EAMT2020, + author = {J{\"o}rg Tiedemann and Santhosh Thottingal}, + title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld}, + booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)}, + year = {2020}, + address = {Lisbon, Portugal} + } + ``` +## Acknowledgements +None of this would be possible without all the great open source software including + +* GNU/Linux tools +* [Marian-NMT](https://github.com/marian-nmt/) +* [eflomal](https://github.com/robertostling/eflomal) + +... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ... + +We would also like to acknowledge the support by the [University of Helsinki](https://blogs.helsinki.fi/language-technology/), the [IT Center of Science CSC](https://www.csc.fi/en/home), the funding through projects in the EU Horizon 2020 framework ([FoTran](http://www.helsinki.fi/fotran), [MeMAD](https://memad.eu/), [ELG](https://www.european-language-grid.eu/)) and the contributors to the open collection of parallel corpora [OPUS](http://opus.nlpl.eu/). diff --git a/doc/Setup.md b/doc/Setup.md index 120b3575..430009e2 100644 --- a/doc/Setup.md +++ b/doc/Setup.md @@ -1,6 +1,5 @@ # Installation and setup - * download the code ``` @@ -12,10 +11,31 @@ git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git ``` git submodule update --init --recursive --remote -make install-prerequisites +make install ``` +## Prerequisites + +The installation procedure should hopefully setup the necessary software for running the OPUS-MT recipes. Be aware that running the scripts does not work out of the box because many settings are adjusted for the local installations on our IT infrastructure at [CSC](https://docs.csc.fi/). Here is an incomplete list of prerequisites needed for running a process: + +* [marian-nmt](https://github.com/marian-nmt/): The essential NMT toolkit we use in OPUS-MT; make sure you compile a version with GPU and SentencePiece support! +* [Moses scripts](https://github.com/moses-smt/mosesdecoder): various pre- and post-processing scripts from the Moses SMT toolkit (also bundled here: [marian-nmt](https://github.com/marian-nmt/moses-scripts)) +* [OpusTools](https://pypi.org/project/opustools): library and tools for accessing OPUS data +* [OpusTools-perl](https://github.com/Helsinki-NLP/OpusTools-perl): additional tools for accessing OPUS data +* [iso-639](https://pypi.org/project/iso-639/): a Python package for ISO 639 language codes +* Perl modules [ISO::639::3](https://metacpan.org/pod/ISO::639::3) and [ISO::639::5](https://metacpan.org/pod/ISO::639::5) +* [jq JSON processor](https://stedolan.github.io/jq/) + +Optional (recommended) software: + +* [terashuf](https://github.com/alexandres/terashuf): efficiently shuffle massive data sets +* [pigz](https://zlib.net/pigz/): multithreaded gzip +* [eflomal](https://github.com/robertostling/eflomal) (needed for word alignment when transformer-align is used) +* [fast_align](https://github.com/clab/fast_align) + + + ## Mac OSX * for Marian-NMT: make sure that you have Xcode, protobuf and MKL installed. Protobuf can be added using, for example Mac ports: diff --git a/lib/env.mk b/lib/env.mk index e68cb7e3..9957a7b5 100644 --- a/lib/env.mk +++ b/lib/env.mk @@ -224,6 +224,7 @@ install-prerequisites install-prereq install-requirements: ${MAKE} install-perl-modules: ${MAKE} ${PREREQ_TOOLS} +.PHONY: install-perl-modules install-perl-modules: for p in ${PREREQ_PERL}; do \ perl -e "use $$p;" || ${CPAN} -i $$p; \