initial import

This commit is contained in:
Joerg Tiedemann 2020-01-10 16:45:42 +02:00
commit b36d9a3e22
2914 changed files with 21790 additions and 0 deletions

396
LICENSE Normal file
View File

@ -0,0 +1,396 @@
Attribution 4.0 International
=======================================================================
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution 4.0 International Public License ("Public License"). To the
extent this Public License may be interpreted as a contract, You are
granted the Licensed Rights in consideration of Your acceptance of
these terms and conditions, and the Licensor grants You such rights in
consideration of benefits the Licensor receives from making the
Licensed Material available under these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
d. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
e. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
f. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
g. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
h. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
i. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
j. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
k. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part; and
b. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
4. If You Share Adapted Material You produce, the Adapter's
License You apply must not prevent recipients of the Adapted
Material from complying with this Public License.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material; and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.
Creative Commons may be contacted at creativecommons.org.

564
Makefile Normal file
View File

@ -0,0 +1,564 @@
# -*-makefile-*-
#
# train Opus-MT models using MarianNMT
#
#--------------------------------------------------------------------
#
# (1) train NMT model
#
# make train .............. train NMT model for current language pair
#
# (2) translate and evaluate
#
# make translate .......... translate test set
# make eval ............... evaluate
#
#--------------------------------------------------------------------
#
# general parameters / variables (see Makefile.config)
# SRCLANGS ............ set source language(s) (en)
# TRGLANGS ............ set target language(s) (de)
#
#
# submit jobs by adding suffix to make-target to be run
# .submit ........ job on GPU nodes (for train and translate)
# .submitcpu ..... job on CPU nodes (for translate and eval)
#
# for example:
# make train.submit
#
# run a multigpu job, for example
# make train-multigpu.submit
# make train-twogpu.submit
# make train-gpu01.submit
# make train-gpu23.submit
#
#
# typical procedure: train and evaluate en-de with 3 models in ensemble
#
# make data.submitcpu
# make vocab.submit
# make NR=1 train.submit
# make NR=2 train.submit
# make NR=3 train.submit
#
# make NR=1 eval.submit
# make NR=2 eval.submit
# make NR=3 eval.submit
# make eval-ensemble.submit
#
#
# include right-to-left models:
#
# make NR=1 train-RL.submit
# make NR=2 train-RL.submit
# make NR=3 train-RL.submit
#
#
#--------------------------------------------------------------------
# train several versions of the same model (for ensembling)
#
# make NR=1 ....
# make NR=2 ....
# make NR=3 ....
#
# DANGER: problem with vocabulary files if you start them simultaneously
# --> racing situation for creating them between the processes
#
#--------------------------------------------------------------------
# resume training
#
# make resume
#
#--------------------------------------------------------------------
# translate with ensembles of models
#
# make translate-ensemble
# make eval-ensemble
#
# this only makes sense if there are several models
# (created with different NR)
#--------------------------------------------------------------------
# check and adjust Makfile.env and Makfile.config
# add specific tasks in Makefile.tasks
include Makefile.env
include Makefile.config
include Makefile.dist
include Makefile.tasks
include Makefile.data
include Makefile.doclevel
include Makefile.slurm
#------------------------------------------------------------------------
# make various data sets
#------------------------------------------------------------------------
.PHONY: data
data: ${TRAIN_SRC}.clean.${PRE_SRC}.gz ${TRAIN_TRG}.clean.${PRE_TRG}.gz \
${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
${MAKE} ${TEST_SRC}.${PRE_SRC} ${TEST_TRG}
${MAKE} ${TRAIN_ALG}
${MAKE} ${MODEL_VOCAB}
traindata: ${TRAIN_SRC}.clean.${PRE_SRC}.gz ${TRAIN_TRG}.clean.${PRE_TRG}.gz
tunedata: ${TUNE_SRC}.${PRE_SRC} ${TUNE_TRG}.${PRE_TRG}
devdata: ${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
testdata: ${TEST_SRC}.${PRE_SRC} ${TEST_TRG}
wordalign: ${TRAIN_ALG}
devdata-raw: ${DEV_SRC} ${DEV_TRG}
#------------------------------------------------------------------------
# train, translate and evaluate
#------------------------------------------------------------------------
## other model types
vocab: ${MODEL_VOCAB}
train: ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.done
translate: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}
eval: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}.eval
compare: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}.compare
## ensemble of models (assumes to find them in subdirs of the WORKDIR)
translate-ensemble: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.ensemble.${SRC}.${TRG}
eval-ensemble: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.ensemble.${SRC}.${TRG}.eval
## resume training on an existing model
resume:
if [ -e ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.npz.best-perplexity.npz ]; then \
cp ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.npz.best-perplexity.npz \
${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.npz; \
fi
sleep 1
rm -f ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.done
${MAKE} train
#------------------------------------------------------------------------
# translate and evaluate all test sets in testsets/
#------------------------------------------------------------------------
## testset dir for all test sets in this language pair
## and all trokenized test sets that can be found in that directory
TESTSET_HOME = ${PWD}/testsets
TESTSET_DIR = ${TESTSET_HOME}/${SRC}-${TRG}
# TESTSETS = $(patsubst ${TESTSET_DIR}/%.${SRC}.tok.gz,%,${wildcard ${TESTSET_DIR}/*.${SRC}.tok.gz})
TESTSETS = $(patsubst ${TESTSET_DIR}/%.${SRC}.gz,%,${wildcard ${TESTSET_DIR}/*.${SRC}.gz})
TESTSETS_PRESRC = $(patsubst %.gz,%.${PRE}.gz,${sort $(subst .${PRE},,${wildcard ${TESTSET_DIR}/*.${SRC}.gz})})
TESTSETS_PRETRG = $(patsubst %.gz,%.${PRE}.gz,${sort $(subst .${PRE},,${wildcard ${TESTSET_DIR}/*.${TRG}.gz})})
## eval all available test sets
eval-testsets:
for s in ${SRCLANGS}; do \
for t in ${TRGLANGS}; do \
${MAKE} SRC=$$s TRG=$$t compare-testsets-langpair; \
done \
done
eval-heldout:
${MAKE} TESTSET_HOME=${HELDOUT_DIR} eval-testsets
%-testsets-langpair: ${TESTSETS_PRESRC} ${TESTSETS_PRETRG}
@echo "testsets: ${TESTSET_DIR}/*.${SRC}.gz"
for t in ${TESTSETS}; do \
${MAKE} TESTSET=$$t ${@:-testsets-langpair=}; \
done
#------------------------------------------------------------------------
# some helper functions
#------------------------------------------------------------------------
## check whether a model is converged or not
finished:
@if grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKDIR}/${MODEL_VALIDLOG}; then\
echo "${WORKDIR}/${MODEL_BASENAME} finished"; \
else \
echo "${WORKDIR}/${MODEL_BASENAME} unfinished"; \
fi
## extension -all: run something over all language pairs, e.g.
## make wordalign-all
## this goes sequentially over all language pairs
## for the parallelizable version of this: look at %-all-parallel
%-all:
for l in ${ALL_LANG_PAIRS}; do \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" ${@:-all=}; \
done
# run something over all language pairs that have trained models
## - make eval-allmodels
## - make dist-allmodels
%-allmodels:
for l in ${ALL_LANG_PAIRS}; do \
if [ `find ${WORKHOME}/$$l -name '*.${PRE_SRC}-${PRE_TRG}.*.npz' | wc -l` -gt 0 ]; then \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" ${@:-allmodels=}; \
fi \
done
## only bilingual models
%-allbilingual:
for l in ${ALL_BILINGUAL_MODELS}; do \
if [ `find ${WORKHOME}/$$l -name '*.${PRE_SRC}-${PRE_TRG}.*.npz' | wc -l` -gt 0 ]; then \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" ${@:-allbilingual=}; \
fi \
done
## run something over all language pairs but make it possible to do it in parallel, for example
## - make dist-all-parallel
%-all-parallel:
${MAKE} $(subst -all-parallel,,${patsubst %,$@__%-run-for-langpair,${ALL_LANG_PAIRS}})
## run a command that includes the langpair, for example
## make wordalign__en-da+sv-run-for-langpair ...... runs wordalign with SRCLANGS="en" TRGLANGS="da sv"
## What is this good for?
## ---> can run many lang-pairs in parallel instead of having a for loop and run sequencetially
%-run-for-langpair:
${MAKE} SRCLANGS='$(subst +, ,$(firstword $(subst -, ,${lastword ${subst __, ,${@:-run-for-langpair=}}})))' \
TRGLANGS='$(subst +, ,$(lastword $(subst -, ,${lastword ${subst __, ,${@:-run-for-langpair=}}})))' \
${shell echo $@ | sed 's/__.*$$//'}
## right-to-left model
%-RL:
${MAKE} MODEL=${MODEL}-RL \
MARIAN_EXTRA="${MARIAN_EXTRA} --right-left" \
${@:-RL=}
## run a multigpu job (2 or 4 GPUs)
%-multigpu %-0123:
${MAKE} NR_GPUS=4 MARIAN_GPUS='0 1 2 3' $(subst -gpu0123,,${@:-multigpu=})
%-twogpu %-gpu01:
${MAKE} NR_GPUS=2 MARIAN_GPUS='0 1' $(subst -gpu01,,${@:-twogpu=})
%-gpu23:
${MAKE} NR_GPUS=2 MARIAN_GPUS='2 3' ${@:-gpu23=}
## run on CPUs (translate-cpu, eval-cpu, translate-ensemble-cpu, ...)
%-cpu:
${MAKE} MARIAN=${MARIANCPU} \
LOADMODS='${LOADCPU}' \
MARIAN_DECODER_FLAGS="${MARIAN_DECODER_CPU}" \
${@:-cpu=}
## document level models
%-doc:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-spm} \
PRE=norm \
PRE_SRC=spm${SRCBPESIZE:000=}k.doc${CONTEXT_SIZE} \
PRE_TRG=spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE} \
${@:-doc=}
## sentence-piece models
%-spm:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-spm} \
PRE=norm \
PRE_SRC=spm${SRCBPESIZE:000=}k \
PRE_TRG=spm${TRGBPESIZE:000=}k \
${@:-spm=}
%-spm-noalign:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-spm-noalign} \
MODELTYPE=transformer \
PRE=norm \
PRE_SRC=spm${SRCBPESIZE:000=}k \
PRE_TRG=spm${TRGBPESIZE:000=}k \
${@:-spm-noalign=}
## BPE models
%-bpe:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe} \
PRE=tok \
MODELTYPE=transformer \
PRE_SRC=bpe${SRCBPESIZE:000=}k \
PRE_TRG=bpe${TRGBPESIZE:000=}k \
${@:-bpe=}
%-bpe-align:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe-align} \
PRE=tok \
PRE_SRC=bpe${SRCBPESIZE:000=}k \
PRE_TRG=bpe${TRGBPESIZE:000=}k \
${@:-bpe-align=}
%-bpe-memad:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe-memad} \
PRE=tok \
MODELTYPE=transformer \
PRE_SRC=bpe${SRCBPESIZE:000=}k \
PRE_TRG=bpe${TRGBPESIZE:000=}k \
${@:-bpe-memad=}
%-bpe-old:
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe-old} \
PRE=tok \
MODELTYPE=transformer \
PRE_SRC=bpe${SRCBPESIZE:000=}k \
PRE_TRG=bpe${TRGBPESIZE:000=}k \
${@:-bpe-old=}
## for the inbuilt sentence-piece segmentation:
# PRE_SRC=txt PRE_TRG=txt
# MARIAN=${MARIAN}-spm
# MODEL_VOCABTYPE=spm
## continue document-level training with a new context size
ifndef NEW_CONTEXT
NEW_CONTEXT = $$(($(CONTEXT_SIZE) + $(CONTEXT_SIZE)))
endif
continue-doctrain:
mkdir -p ${WORKDIR}/${MODEL}
cp ${MODEL_VOCAB} ${WORKDIR}/${MODEL}/$(subst .doc${CONTEXT_SIZE},.doc${NEW_CONTEXT},${notdir ${MODEL_VOCAB}})
cp ${MODEL_FINAL} ${WORKDIR}/${MODEL}/$(subst .doc${CONTEXT_SIZE},.doc${NEW_CONTEXT},$(notdir ${MODEL_BASENAME})).npz
${MAKE} MODEL_SUBDIR=${MODEL}/ CONTEXT_SIZE=$(NEW_CONTEXT) train-doc
## continue training with a new dataset
ifndef NEW_DATASET
NEW_DATASET = OpenSubtitles
endif
continue-datatrain:
mkdir -p ${WORKDIR}/${MODEL}
cp ${MODEL_VOCAB} ${WORKDIR}/${MODEL}/$(patsubst ${DATASET}%,${NEW_DATASET}%,${notdir ${MODEL_VOCAB}})
cp ${MODEL_FINAL} ${WORKDIR}/${MODEL}/$(patsubst ${DATASET}%,${NEW_DATASET}%,${MODEL_BASENAME}).npz
if [ -e ${BPESRCMODEL} ]; then \
cp ${BPESRCMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${BPESRCMODEL}); \
cp ${BPETRGMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${BPETRGMODEL}); \
fi
if [ -e ${SPMSRCMODEL} ]; then \
cp ${SPMSRCMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${SPMSRCMODEL}); \
cp ${SPMTRGMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${SPMTRGMODEL}); \
fi
${MAKE} MODEL_SUBDIR=${MODEL}/ DATASET=$(NEW_DATASET) train
# MARIAN_EXTRA="${MARIAN_EXTRA} --no-restore-corpus"
#------------------------------------------------------------------------
# training MarianNMT models
#------------------------------------------------------------------------
## make vocabulary
## - no new vocabulary is created if the file already exists!
## - need to delete the file if you want to create a new one!
${MODEL_VOCAB}: ${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz
ifeq ($(wildcard ${MODEL_VOCAB}),)
mkdir -p ${dir $@}
${LOADMODS} && zcat $^ | ${MARIAN}/marian-vocab --max-size ${VOCABSIZE} > $@
else
@echo "$@ already exists!"
@echo "WARNING! No new vocabulary is created even though the data has changed!"
@echo "WARNING! Delete the file if you want to start from scratch!"
touch $@
endif
## NEW: take away dependency on ${MODEL_VOCAB}
## (will be created by marian if it does not exist)
## train transformer model
${WORKDIR}/${MODEL}.transformer.model${NR}.done: \
${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz \
${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
mkdir -p ${dir $@}
${LOADMODS} && ${MARIAN}/marian ${MARIAN_EXTRA} \
--model $(@:.done=.npz) \
--type transformer \
--train-sets ${word 1,$^} ${word 2,$^} ${MARIAN_TRAIN_WEIGHTS} \
--max-length 500 \
--vocabs ${MODEL_VOCAB} ${MODEL_VOCAB} \
--mini-batch-fit \
-w ${MARIAN_WORKSPACE} \
--maxi-batch ${MARIAN_MAXI_BATCH} \
--early-stopping ${MARIAN_EARLY_STOPPING} \
--valid-freq ${MARIAN_VALID_FREQ} \
--save-freq ${MARIAN_SAVE_FREQ} \
--disp-freq ${MARIAN_DISP_FREQ} \
--valid-sets ${word 3,$^} ${word 4,$^} \
--valid-metrics perplexity \
--valid-mini-batch ${MARIAN_VALID_MINI_BATCH} \
--beam-size 12 --normalize 1 \
--log $(@:.model${NR}.done=.train${NR}.log) --valid-log $(@:.model${NR}.done=.valid${NR}.log) \
--enc-depth 6 --dec-depth 6 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout ${MARIAN_DROPOUT} \
--label-smoothing 0.1 \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--tied-embeddings-all \
--overwrite --keep-best \
--devices ${MARIAN_GPUS} \
--sync-sgd --seed ${SEED} \
--sqlite \
--tempdir ${TMPDIR} \
--exponential-smoothing
touch $@
## NEW: take away dependency on ${MODEL_VOCAB}
## train transformer model with guided alignment
${WORKDIR}/${MODEL}.transformer-align.model${NR}.done: \
${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz \
${TRAIN_ALG} \
${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
mkdir -p ${dir $@}
${LOADMODS} && ${MARIAN}/marian ${MARIAN_EXTRA} \
--model $(@:.done=.npz) \
--type transformer \
--train-sets ${word 1,$^} ${word 2,$^} ${MARIAN_TRAIN_WEIGHTS} \
--max-length 500 \
--vocabs ${MODEL_VOCAB} ${MODEL_VOCAB} \
--mini-batch-fit \
-w ${MARIAN_WORKSPACE} \
--maxi-batch ${MARIAN_MAXI_BATCH} \
--early-stopping ${MARIAN_EARLY_STOPPING} \
--valid-freq ${MARIAN_VALID_FREQ} \
--save-freq ${MARIAN_SAVE_FREQ} \
--disp-freq ${MARIAN_DISP_FREQ} \
--valid-sets ${word 4,$^} ${word 5,$^} \
--valid-metrics perplexity \
--valid-mini-batch ${MARIAN_VALID_MINI_BATCH} \
--beam-size 12 --normalize 1 \
--log $(@:.model${NR}.done=.train${NR}.log) --valid-log $(@:.model${NR}.done=.valid${NR}.log) \
--enc-depth 6 --dec-depth 6 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout ${MARIAN_DROPOUT} \
--label-smoothing 0.1 \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--tied-embeddings-all \
--overwrite --keep-best \
--devices ${MARIAN_GPUS} \
--sync-sgd --seed ${SEED} \
--sqlite \
--tempdir ${TMPDIR} \
--exponential-smoothing \
--guided-alignment ${word 3,$^}
touch $@
#------------------------------------------------------------------------
# translate with an ensemble of several models
#------------------------------------------------------------------------
ENSEMBLE = ${wildcard ${WORKDIR}/${MODEL}.${MODELTYPE}.model*.npz.best-perplexity.npz}
${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.ensemble.${SRC}.${TRG}: ${TEST_SRC}.${PRE_SRC} ${ENSEMBLE}
mkdir -p ${dir $@}
grep . $< > $@.input
${LOADMODS} && ${MARIAN}/marian-decoder -i $@.input \
--models ${ENSEMBLE} \
--vocabs ${WORKDIR}/${MODEL}.vocab.yml \
${WORKDIR}/${MODEL}.vocab.yml \
${WORKDIR}/${MODEL}.vocab.yml \
${MARIAN_DECODER_FLAGS} > $@.output
ifeq (${PRE_TRG},spm${TRGBPESIZE:000=}k)
sed 's/ //g;s/▁/ /g' < $@.output | sed 's/^ *//;s/ *$$//' > $@
else
sed 's/\@\@ //g;s/ \@\@//g;s/ \@\-\@ /-/g' < $@.output |\
$(TOKENIZER)/detokenizer.perl -l ${TRG} > $@
endif
rm -f $@.input $@.output
#------------------------------------------------------------------------
# translate, evaluate and generate a file
# for comparing system to reference translations
#------------------------------------------------------------------------
${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}: ${TEST_SRC}.${PRE_SRC} ${MODEL_FINAL}
mkdir -p ${dir $@}
grep . $< > $@.input
${LOADMODS} && ${MARIAN}/marian-decoder -i $@.input \
-c ${word 2,$^}.decoder.yml \
-d ${MARIAN_GPUS} \
${MARIAN_DECODER_FLAGS} > $@.output
ifeq (${PRE_TRG},spm${TRGBPESIZE:000=}k)
sed 's/ //g;s/▁/ /g' < $@.output | sed 's/^ *//;s/ *$$//' > $@
else
sed 's/\@\@ //g;s/ \@\@//g;s/ \@\-\@ /-/g' < $@.output |\
$(TOKENIZER)/detokenizer.perl -l ${TRG} > $@
endif
rm -f $@.input $@.output
# %.eval: % ${TEST_TRG}
# grep . ${TEST_TRG} > $@.ref
# grep . $< > $@.sys
# cat $@.sys | sacrebleu $@.ref > $@
# cat $@.sys | sacrebleu --metrics=chrf --width=3 $@.ref >> $@
# rm -f $@.ref $@.sys
%.eval: % ${TEST_TRG}
paste ${TEST_SRC}.${PRE_SRC} ${TEST_TRG} | grep $$'.\t' | cut -f2 > $@.ref
cat $< | sacrebleu $@.ref > $@
cat $< | sacrebleu --metrics=chrf --width=3 $@.ref >> $@
rm -f $@.ref
%.compare: %.eval
paste -d "\n" ${TEST_SRC} ${TEST_TRG} ${<:.eval=} |\
sed -e "s/&apos;/'/g" \
-e 's/&quot;/"/g' \
-e 's/&lt;/</g' \
-e 's/&gt;/>/g' \
-e 's/&amp;/&/g' |\
sed 'n;n;G;' > $@

269
Makefile.config Normal file
View File

@ -0,0 +1,269 @@
# -*-makefile-*-
#
# model configurations
#
# SRCLANGS = da no sv
# TRGLANGS = fi
SRCLANGS = sv
TRGLANGS = fi
ifndef SRC
SRC := ${firstword ${SRCLANGS}}
endif
ifndef TRG
TRG := ${lastword ${TRGLANGS}}
endif
# sorted languages and langpair used to match resources in OPUS
SORTLANGS = $(sort ${SRC} ${TRG})
SPACE = $(empty) $(empty)
LANGPAIR = ${firstword ${SORTLANGS}}-${lastword ${SORTLANGS}}
LANGSTR = ${subst ${SPACE},+,$(SRCLANGS)}-${subst ${SPACE},+,$(TRGLANGS)}
## for same language pairs: add numeric extension
ifeq (${SRC},$(TRG))
SRCEXT = ${SRC}1
TRGEXT = ${SRC}2
else
SRCEXT = ${SRC}
TRGEXT = ${TRG}
endif
## all of OPUS (NEW: don't require MOSES format)
# OPUSCORPORA = ${patsubst %/latest/moses/${LANGPAIR}.txt.zip,%,\
# ${patsubst ${OPUSHOME}/%,%,\
# ${shell ls ${OPUSHOME}/*/latest/moses/${LANGPAIR}.txt.zip}}}
OPUSCORPORA = ${patsubst %/latest/xml/${LANGPAIR}.xml.gz,%,\
${patsubst ${OPUSHOME}/%,%,\
${shell ls ${OPUSHOME}/*/latest/xml/${LANGPAIR}.xml.gz}}}
ALL_LANG_PAIRS = ${shell ls ${WORKHOME} | grep -- '-' | grep -v old}
ALL_BILINGUAL_MODELS = ${shell ls ${WORKHOME} | grep -- '-' | grep -v old | grep -v -- '\+'}
ALL_MULTILINGUAL_MODELS = ${shell ls ${WORKHOME} | grep -- '-' | grep -v old | grep -- '\+'}
## size of dev data, test data and BPE merge operations
DEVSIZE = 5000
TESTSIZE = 5000
## NEW: significantly reduce devminsize
## (= absolute minimum we need as devdata)
## NEW: define an alternative small size for DEV and TEST
## OLD DEVMINSIZE:
# DEVMINSIZE = 1000
DEVSMALLSIZE = 1000
TESTSMALLSIZE = 1000
DEVMINSIZE = 250
## size of heldout data for each sub-corpus
## (only if there is at least twice as many examples in the corpus)
HELDOUTSIZE = ${DEVSIZE}
##----------------------------------------------------------------------------
## train/dev/test data
##----------------------------------------------------------------------------
## dev/test data: default = Tatoeba otherwise, GlobalVoices, JW300, GNOME or bibl-uedin
## - check that data exist
## - check that there are at least 2 x DEVMINSIZE examples
ifneq ($(wildcard ${OPUSHOME}/Tatoeba/latest/moses/${LANGPAIR}.txt.zip),)
ifeq ($(shell if (( `head -1 ${OPUSHOME}/Tatoeba/latest/info/${LANGPAIR}.txt.info` \
> $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then echo "ok"; fi),ok)
DEVSET = Tatoeba
endif
endif
## backoff to GlobalVoices
ifndef DEVSET
ifneq ($(wildcard ${OPUSHOME}/GlobalVoices/latest/moses/${LANGPAIR}.txt.zip),)
ifeq ($(shell if (( `head -1 ${OPUSHOME}/GlobalVoices/latest/info/${LANGPAIR}.txt.info` \
> $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then echo "ok"; fi),ok)
DEVSET = GlobalVoices
endif
endif
endif
## backoff to JW300
ifndef DEVSET
ifneq ($(wildcard ${OPUSHOME}/JW300/latest/xml/${LANGPAIR}.xml.gz),)
ifeq ($(shell if (( `sed -n 2p ${OPUSHOME}/JW300/latest/info/${LANGPAIR}.info` \
> $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then echo "ok"; fi),ok)
DEVSET = JW300
endif
endif
endif
## otherwise: bible-uedin
ifndef DEVSET
DEVSET = bible-uedin
endif
## in case we want to use some additional data sets
EXTRA_TRAINSET =
## TESTSET= DEVSET, TRAINSET = OPUS - WMT-News,DEVSET.TESTSET
TESTSET = ${DEVSET}
TRAINSET = $(filter-out WMT-News ${DEVSET} ${TESTSET},${OPUSCORPORA} ${EXTRA_TRAINSET})
TUNESET = OpenSubtitles
## 1 = use remaining data from dev/test data for training
USE_REST_DEVDATA = 1
##----------------------------------------------------------------------------
## pre-processing and vocabulary
##----------------------------------------------------------------------------
BPESIZE = 32000
SRCBPESIZE = ${BPESIZE}
TRGBPESIZE = ${BPESIZE}
ifndef VOCABSIZE
VOCABSIZE = $$((${SRCBPESIZE} + ${TRGBPESIZE} + 1000))
endif
## for document-level models
CONTEXT_SIZE = 100
## pre-processing type
PRE = norm
PRE_SRC = spm${SRCBPESIZE:000=}k
PRE_TRG = spm${TRGBPESIZE:000=}k
##-------------------------------------
## name of the data set (and the model)
## - single corpus = use that name
## - multiple corpora = opus
## add also vocab size to the name
##-------------------------------------
ifndef DATASET
ifeq (${words ${TRAINSET}},1)
DATASET = ${TRAINSET}
else
DATASET = opus
endif
endif
## DATADIR = directory where the train/dev/test data are
## WORKDIR = directory used for training
DATADIR = ${WORKHOME}/data
WORKDIR = ${WORKHOME}/${LANGSTR}
## data sets
TRAIN_BASE = ${WORKDIR}/train/${DATASET}
TRAIN_SRC = ${TRAIN_BASE}.src
TRAIN_TRG = ${TRAIN_BASE}.trg
TRAIN_ALG = ${TRAIN_BASE}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}.src-trg.alg.gz
## training data in local space
LOCAL_TRAIN_SRC = ${TMPDIR}/${LANGSTR}/train/${DATASET}.src
LOCAL_TRAIN_TRG = ${TMPDIR}/${LANGSTR}/train/${DATASET}.trg
TUNE_SRC = ${WORKDIR}/tune/${TUNESET}.src
TUNE_TRG = ${WORKDIR}/tune/${TUNESET}.trg
DEV_SRC = ${WORKDIR}/val/${DEVSET}.src
DEV_TRG = ${WORKDIR}/val/${DEVSET}.trg
TEST_SRC = ${WORKDIR}/test/${TESTSET}.src
TEST_TRG = ${WORKDIR}/test/${TESTSET}.trg
## heldout data directory (keep one set per data set)
HELDOUT_DIR = ${WORKDIR}/heldout
MODEL_SUBDIR =
MODEL = ${MODEL_SUBDIR}${DATASET}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}
MODELTYPE = transformer-align
NR = 1
MODEL_BASENAME = ${MODEL}.${MODELTYPE}.model${NR}
MODEL_VALIDLOG = ${MODEL}.${MODELTYPE}.valid${NR}.log
MODEL_TRAINLOG = ${MODEL}.${MODELTYPE}.train${NR}.log
MODEL_FINAL = ${WORKDIR}/${MODEL_BASENAME}.npz.best-perplexity.npz
MODEL_VOCABTYPE = yml
MODEL_VOCAB = ${WORKDIR}/${MODEL}.vocab.${MODEL_VOCABTYPE}
MODEL_DECODER = ${MODEL_FINAL}.decoder.yml
## test set translation and scores
TEST_TRANSLATION = ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}
TEST_EVALUATION = ${TEST_TRANSLATION}.eval
TEST_COMPARISON = ${TEST_TRANSLATION}.compare
## parameters for running Marian NMT
MARIAN_GPUS = 0
MARIAN_EXTRA =
MARIAN_VALID_FREQ = 10000
MARIAN_SAVE_FREQ = ${MARIAN_VALID_FREQ}
MARIAN_DISP_FREQ = ${MARIAN_VALID_FREQ}
MARIAN_EARLY_STOPPING = 10
MARIAN_VALID_MINI_BATCH = 16
MARIAN_MAXI_BATCH = 500
MARIAN_DROPOUT = 0.1
MARIAN_DECODER_GPU = -b 12 -n1 -d ${MARIAN_GPUS} --mini-batch 8 --maxi-batch 32 --maxi-batch-sort src
MARIAN_DECODER_CPU = -b 12 -n1 --cpu-threads ${HPC_CORES} --mini-batch 8 --maxi-batch 32 --maxi-batch-sort src
MARIAN_DECODER_FLAGS = ${MARIAN_DECODER_GPU}
## TODO: currently marianNMT crashes with workspace > 26000
ifeq (${GPU},p100)
MARIAN_WORKSPACE = 13000
else ifeq (${GPU},v100)
# MARIAN_WORKSPACE = 30000
# MARIAN_WORKSPACE = 26000
# MARIAN_WORKSPACE = 24000
# MARIAN_WORKSPACE = 18000
MARIAN_WORKSPACE = 16000
else
MARIAN_WORKSPACE = 10000
endif
ifeq (${shell nvidia-smi | grep failed | wc -l},1)
MARIAN = ${MARIANCPU}
MARIAN_DECODER_FLAGS = ${MARIAN_DECODER_CPU}
MARIAN_EXTRA = --cpu-threads ${HPC_CORES}
endif
ifneq ("$(wildcard ${TRAIN_WEIGHTS})","")
MARIAN_TRAIN_WEIGHTS = --data-weighting ${TRAIN_WEIGHTS}
endif
### training a model with Marian NMT
##
## NR allows to train several models for proper ensembling
## (with shared vocab)
##
## DANGER: if several models are started at the same time
## then there is some racing issue with creating the vocab!
ifdef NR
SEED=${NR}${NR}${NR}${NR}
else
SEED=1234
endif

866
Makefile.data Normal file
View File

@ -0,0 +1,866 @@
# -*-makefile-*-
ifndef SRCLANGS
SRCLANGS=${SRC}
endif
ifndef SRCLANGS
TRGLANGS=${TRG}
endif
ifndef THREADS
THREADS=${HPC_CORES}
endif
CLEAN_TRAIN_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${TRAINSET}}
CLEAN_TRAIN_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_TRAIN_SRC}}
CLEAN_TUNE_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${TUNESET}}
CLEAN_TUNE_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_TUNE_SRC}}
CLEAN_DEV_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${DEVSET}}
CLEAN_DEV_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_DEV_SRC}}
CLEAN_TEST_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${TESTSET}}
CLEAN_TEST_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_TEST_SRC}}
DATA_SRC := ${sort ${CLEAN_TRAIN_SRC} ${CLEAN_TUNE_SRC} ${CLEAN_DEV_SRC} ${CLEAN_TEST_SRC}}
DATA_TRG := ${sort ${CLEAN_TRAIN_TRG} ${CLEAN_TUNE_TRG} ${CLEAN_DEV_TRG} ${CLEAN_TEST_TRG}}
## make data in reverse direction without re-doing word alignment etc ...
## ---> this is dangerous when things run in parallel
## ---> only works for bilingual models
REV_LANGSTR = ${subst ${SPACE},+,$(TRGLANGS)}-${subst ${SPACE},+,$(SRCLANGS)}
REV_WORKDIR = ${WORKHOME}/${REV_LANGSTR}
reverse-data:
ifeq (${PRE_SRC},${PRE_TRG})
ifeq (${words ${SRCLANGS}},1)
ifeq (${words ${TRGLANGS}},1)
-if [ -e ${TRAIN_SRC}.clean.${PRE_SRC}.gz ]; then \
mkdir -p ${REV_WORKDIR}/train; \
ln -s ${TRAIN_SRC}.clean.${PRE_SRC}.gz ${REV_WORKDIR}/train/${notdir ${TRAIN_TRG}.clean.${PRE_TRG}.gz}; \
ln -s ${TRAIN_TRG}.clean.${PRE_TRG}.gz ${REV_WORKDIR}/train/${notdir ${TRAIN_SRC}.clean.${PRE_SRC}.gz}; \
fi
-if [ -e ${SPMSRCMODEL} ]; then \
ln -s ${SPMSRCMODEL} ${REV_WORKDIR}/train/${notdir ${SPMTRGMODEL}}; \
ln -s ${SPMTRGMODEL} ${REV_WORKDIR}/train/${notdir ${SPMSRCMODEL}}; \
fi
if [ -e ${BPESRCMODEL} ]; then \
ln -s ${BPESRCMODEL} ${REV_WORKDIR}/train/${notdir ${BPETRGMODEL}}; \
ln -s ${BPETRGMODEL} ${REV_WORKDIR}/train/${notdir ${BPESRCMODEL}}; \
fi
-if [ -e ${TRAIN_ALG} ]; then \
if [ ! -e ${REV_WORKDIR}/train/${notdir ${TRAIN_ALG}} ]; then \
zcat ${TRAIN_ALG} | ${MOSESSCRIPTS}/generic/reverse-alignment.perl |\
gzip -c > ${REV_WORKDIR}/train/${notdir ${TRAIN_ALG}}; \
fi \
fi
-if [ -e ${DEV_SRC}.${PRE_SRC} ]; then \
mkdir -p ${REV_WORKDIR}/val; \
ln -s ${DEV_SRC}.${PRE_SRC} ${REV_WORKDIR}/val/${notdir ${DEV_TRG}.${PRE_TRG}}; \
ln -s ${DEV_TRG}.${PRE_TRG} ${REV_WORKDIR}/val/${notdir ${DEV_SRC}.${PRE_SRC}}; \
ln -s ${DEV_SRC} ${REV_WORKDIR}/val/${notdir ${DEV_TRG}}; \
ln -s ${DEV_TRG} ${REV_WORKDIR}/val/${notdir ${DEV_SRC}}; \
ln -s ${DEV_SRC}.shuffled.gz ${REV_WORKDIR}/val/${notdir ${DEV_SRC}.shuffled.gz}; \
fi
-if [ -e ${TEST_SRC} ]; then \
mkdir -p ${REV_WORKDIR}/test; \
ln -s ${TEST_SRC} ${REV_WORKDIR}/test/${notdir ${TEST_TRG}}; \
ln -s ${TEST_TRG} ${REV_WORKDIR}/test/${notdir ${TEST_SRC}}; \
fi
-if [ -e ${MODEL_VOCAB} ]; then \
ln -s ${MODEL_VOCAB} ${REV_WORKDIR}/${notdir ${MODEL_VOCAB}}; \
fi
endif
endif
endif
clean-data:
for s in ${SRCLANGS}; do \
for t in ${TRGLANGS}; do \
${MAKE} SRC=$$s TRG=$$t clean-data-source; \
done \
done
clean-data-source: ${DATA_SRC} ${DATA_TRG}
## word alignment used for guided alignment
.INTERMEDIATE: ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
${LOCAL_TRAIN_SRC}.algtmp: ${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz
mkdir -p ${dir $@}
gzip -cd < $< > $@
${LOCAL_TRAIN_TRG}.algtmp: ${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz
mkdir -p ${dir $@}
gzip -cd < $< > $@
## max number of lines in a corpus for running word alignment
## (split into chunks of max that size before aligning)
MAX_WORDALIGN_SIZE = 5000000
# MAX_WORDALIGN_SIZE = 10000000
# MAX_WORDALIGN_SIZE = 25000000
${TRAIN_ALG}: ${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz
${MAKE} ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
if [ `head $(LOCAL_TRAIN_SRC).algtmp | wc -l` -gt 0 ]; then \
mkdir -p $(LOCAL_TRAIN_SRC).algtmp.d; \
mkdir -p $(LOCAL_TRAIN_TRG).algtmp.d; \
split -l ${MAX_WORDALIGN_SIZE} $(LOCAL_TRAIN_SRC).algtmp $(LOCAL_TRAIN_SRC).algtmp.d/; \
split -l ${MAX_WORDALIGN_SIZE} $(LOCAL_TRAIN_TRG).algtmp $(LOCAL_TRAIN_TRG).algtmp.d/; \
for s in `ls $(LOCAL_TRAIN_SRC).algtmp.d`; do \
echo "align part $$s"; \
${WORDALIGN} --overwrite \
-s $(LOCAL_TRAIN_SRC).algtmp.d/$$s \
-t $(LOCAL_TRAIN_TRG).algtmp.d/$$s \
-f $(LOCAL_TRAIN_SRC).algtmp.d/$$s.fwd \
-r $(LOCAL_TRAIN_TRG).algtmp.d/$$s.rev; \
done; \
echo "merge and symmetrize"; \
cat $(LOCAL_TRAIN_SRC).algtmp.d/*.fwd > $(LOCAL_TRAIN_SRC).fwd; \
cat $(LOCAL_TRAIN_TRG).algtmp.d/*.rev > $(LOCAL_TRAIN_TRG).rev; \
${ATOOLS} -c grow-diag-final -i $(LOCAL_TRAIN_SRC).fwd -j $(LOCAL_TRAIN_TRG).rev |\
gzip -c > $@; \
rm -f ${LOCAL_TRAIN_SRC}.algtmp.d/*; \
rm -f ${LOCAL_TRAIN_TRG}.algtmp.d/*; \
rmdir ${LOCAL_TRAIN_SRC}.algtmp.d; \
rmdir ${LOCAL_TRAIN_TRG}.algtmp.d; \
rm -f $(LOCAL_TRAIN_SRC).fwd $(LOCAL_TRAIN_TRG).rev; \
fi
rm -f ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
## old way of word alignment with all the data in one process
## --> this may take a long time for very large corpora
## --> may also take a lot of memory (split instead, see above)
# ${TRAIN_ALG}: ${TRAIN_SRC}.${PRE_SRC}${TRAINSIZE}.gz \
# ${TRAIN_TRG}.${PRE_TRG}${TRAINSIZE}.gz
# ${MAKE} ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
# if [ `head $(LOCAL_TRAIN_SRC).algtmp | wc -l` -gt 0 ]; then \
# ${WORDALIGN} -s $(LOCAL_TRAIN_SRC).algtmp -t $(LOCAL_TRAIN_TRG).algtmp \
# --overwrite -f $(LOCAL_TRAIN_SRC).fwd -r $(LOCAL_TRAIN_TRG).rev; \
# ${ATOOLS} -c grow-diag-final -i $(LOCAL_TRAIN_SRC).fwd -j $(LOCAL_TRAIN_TRG).rev |\
# gzip -c > $@; \
# fi
# rm -f ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
# rm -f $(LOCAL_TRAIN_SRC).fwd $(LOCAL_TRAIN_TRG).rev
## copy OPUS data
## (check that the OPUS file really exists! if not, create and empty file)
##
## TODO: should e read all data from scratch using opus_read?
## - also: langid filtering and link prob filtering?
%.${SRCEXT}.raw:
mkdir -p ${dir $@}
c=${patsubst %.${LANGPAIR}.${SRCEXT}.raw,%,${notdir $@}}; \
if [ -e ${OPUSHOME}/$$c/latest/moses/${LANGPAIR}.txt.zip ]; then \
scp ${OPUSHOME}/$$c/latest/moses/${LANGPAIR}.txt.zip $@.zip; \
unzip -d ${dir $@} $@.zip -x README LICENSE; \
mv ${dir $@}$$c*.${LANGPAIR}.${SRCEXT} $@; \
mv ${dir $@}$$c*.${LANGPAIR}.${TRGEXT} \
${@:.${SRCEXT}.raw=.${TRGEXT}.raw}; \
rm -f $@.zip ${@:.${SRCEXT}.raw=.xml} ${@:.${SRCEXT}.raw=.ids} ${dir $@}/README ${dir $@}/LICENSE; \
elif [ -e ${OPUSHOME}/$$c/latest/xml/${LANGPAIR}.xml.gz ]; then \
echo "extract $$c (${LANGPAIR}) from OPUS"; \
opus_read -rd ${OPUSHOME} -d $$c -s ${SRC} -t ${TRG} -wm moses -p raw > $@.tmp; \
cut -f1 $@.tmp > $@; \
cut -f2 $@.tmp > ${@:.${SRCEXT}.raw=.${TRGEXT}.raw}; \
rm -f $@.tmp; \
else \
touch $@; \
touch ${@:.${SRCEXT}.raw=.${TRGEXT}.raw}; \
fi
%.${TRGEXT}.raw: %.${SRCEXT}.raw
@echo "done!"
## clean data
## OLD: apply cleanup script from Moses
## --> this might not be a good idea before subword splitting for languages without spaces
## NEW: do this later after splitting into subword units
##
## TODO:
## - does this effect sentence piece / BPE models in some negative way?
## - should we increase the length filter when cleaning later? How much?
## - should we apply some other cleanup scripts here to get rid of some messy stuff?
# ## this is too strict for non-latin languages
# # grep -i '[a-zäöå0-9]' |\
## OLD:
##
# %.clean.${SRCEXT}.gz: %.${SRCEXT}.${PRE} %.${TRGEXT}.${PRE}
# rm -f $@.${SRCEXT} $@.${TRGEXT}
# ln -s ${word 1,$^} $@.${SRCEXT}
# ln -s ${word 2,$^} $@.${TRGEXT}
# $(MOSESSCRIPTS)/training/clean-corpus-n.perl $@ $(SRCEXT) $(TRGEXT) ${@:.${SRCEXT}.gz=} 0 100
# rm -f $@.${SRCEXT} $@.${TRGEXT}
# paste ${@:.gz=} ${@:.${SRCEXT}.gz=.${TRGEXT}} |\
# perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' > $@.tmp
# rm -f ${@:.gz=} ${@:.${SRCEXT}.gz=.${TRGEXT}}
# cut -f1 $@.tmp | gzip -c > $@
# cut -f2 $@.tmp | gzip -c > ${@:.${SRCEXT}.gz=.${TRGEXT}.gz}
# rm -f $@.tmp
# %.clean.${TRGEXT}.gz: %.clean.${SRCEXT}.gz
# @echo "done!"
%.clean.${SRCEXT}.gz: %.${SRCEXT}.${PRE} %.${TRGEXT}.${PRE}
cat $< |\
perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
gzip -c > $@
%.clean.${TRGEXT}.gz: %.${TRGEXT}.${PRE}
cat $< |\
perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
gzip -c > $@
## add training data for each language combination
## and put it together in local space
${LOCAL_TRAIN_SRC}: ${DEV_SRC} ${DEV_TRG}
mkdir -p ${dir $@}
rm -f ${LOCAL_TRAIN_SRC} ${LOCAL_TRAIN_TRG}
-for s in ${SRCLANGS}; do \
for t in ${TRGLANGS}; do \
if [ ${HELDOUTSIZE} -gt 0 ]; then \
${MAKE} DATASET=${DATASET} SRC:=$$s TRG:=$$t \
add-to-local-train-and-heldout-data; \
else \
${MAKE} DATASET=${DATASET} SRC:=$$s TRG:=$$t \
add-to-local-train-data; \
fi \
done \
done
ifeq (${USE_REST_DEVDATA},1)
if [ -e ${DEV_SRC}.notused.gz ]; then \
zcat ${DEV_SRC}.notused.gz >> ${LOCAL_TRAIN_SRC}; \
zcat ${DEV_TRG}.notused.gz >> ${LOCAL_TRAIN_TRG}; \
fi
endif
# ${MAKE} DATASET=${DATASET} SRC:=$$s TRG:=$$t add-to-local-train-data; \
${LOCAL_TRAIN_TRG}: ${LOCAL_TRAIN_SRC}
@echo "done!"
## add to the training data
add-to-local-train-data: ${CLEAN_TRAIN_SRC} ${CLEAN_TRAIN_TRG}
ifneq (${CLEAN_TRAIN_SRC},)
echo "${CLEAN_TRAIN_SRC}" >> ${dir ${LOCAL_TRAIN_SRC}}/README
ifneq (${words ${TRGLANGS}},1)
echo "more than one target language";
zcat ${CLEAN_TRAIN_SRC} |\
sed "s/^/>>${TRG}<< /" >> ${LOCAL_TRAIN_SRC}
else
echo "only one target language"
zcat ${CLEAN_TRAIN_SRC} >> ${LOCAL_TRAIN_SRC}
endif
zcat ${CLEAN_TRAIN_TRG} >> ${LOCAL_TRAIN_TRG}
endif
## extract training data but keep some heldout data for each dataset
add-to-local-train-and-heldout-data: ${CLEAN_TRAIN_SRC} ${CLEAN_TRAIN_TRG}
ifneq (${CLEAN_TRAIN_SRC},)
echo "${CLEAN_TRAIN_SRC}" >> ${dir ${LOCAL_TRAIN_SRC}}/README
mkdir -p ${HELDOUT_DIR}/${SRC}-${TRG}
ifneq (${words ${TRGLANGS}},1)
echo "more than one target language";
for c in ${CLEAN_TRAIN_SRC}; do \
if (( `zcat $$c | head -$$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) | wc -l` == $$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) )); then \
zcat $$c | tail -n +$$(($(HELDOUTSIZE) + 1)) |\
sed "s/^/>>${TRG}<< /" >> ${LOCAL_TRAIN_SRC}; \
zcat $$c | head -$(HELDOUTSIZE) |\
sed "s/^/>>${TRG}<< /" | gzip -c \
> ${HELDOUT_DIR}/${SRC}-${TRG}/`basename $$c`; \
else \
zcat $$c | sed "s/^/>>${TRG}<< /" >> ${LOCAL_TRAIN_SRC}; \
fi \
done
else
echo "only one target language"
for c in ${CLEAN_TRAIN_SRC}; do \
if (( `zcat $$c | head -$$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) | wc -l` == $$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) )); then \
zcat $$c | tail -n +$$(($(HELDOUTSIZE) + 1)) >> ${LOCAL_TRAIN_SRC}; \
zcat $$c | head -$(HELDOUTSIZE) |\
gzip -c > ${HELDOUT_DIR}/${SRC}-${TRG}/`basename $$c`; \
else \
zcat $$c >> ${LOCAL_TRAIN_SRC}; \
fi \
done
endif
for c in ${CLEAN_TRAIN_TRG}; do \
if (( `zcat $$c | head -$$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) | wc -l` == $$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) )); then \
zcat $$c | tail -n +$$(($(HELDOUTSIZE) + 1)) >> ${LOCAL_TRAIN_TRG}; \
zcat $$c | head -$(HELDOUTSIZE) |\
gzip -c > ${HELDOUT_DIR}/${SRC}-${TRG}/`basename $$c`; \
else \
zcat $$c >> ${LOCAL_TRAIN_TRG}; \
fi \
done
endif
####################
# development data
####################
${DEV_SRC}.shuffled.gz:
mkdir -p ${dir $@}
rm -f ${DEV_SRC} ${DEV_TRG}
-for s in ${SRCLANGS}; do \
for t in ${TRGLANGS}; do \
${MAKE} SRC=$$s TRG=$$t add-to-dev-data; \
done \
done
paste ${DEV_SRC} ${DEV_TRG} | shuf | gzip -c > $@
## if we have less than twice the amount of DEVMINSIZE in the data set
## --> extract some data from the training data to be used as devdata
${DEV_SRC}: %: %.shuffled.gz
## if we extract test and dev data from the same data set
## ---> make sure that we do not have any overlap between the two data sets
## ---> reserve at least DEVMINSIZE data for dev data and keep the rest for testing
ifeq (${DEVSET},${TESTSET})
if (( `zcat $@.shuffled.gz | wc -l` < $$((${DEVSIZE} + ${TESTSIZE})) )); then \
if (( `zcat $@.shuffled.gz | wc -l` < $$((${DEVSMALLSIZE} + ${DEVMINSIZE})) )); then \
echo "devset = top ${DEVMINSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README; \
zcat $@.shuffled.gz | cut -f1 | head -${DEVMINSIZE} > ${DEV_SRC}; \
zcat $@.shuffled.gz | cut -f2 | head -${DEVMINSIZE} > ${DEV_TRG}; \
mkdir -p ${dir ${TEST_SRC}}; \
echo "testset = top ${DEVMINSIZE} lines of ../val/${notdir $@}.shuffled!" >> ${dir ${TEST_SRC}}/README; \
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_SRC}; \
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_TRG}; \
else \
echo "devset = top ${DEVSMALLSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README; \
zcat $@.shuffled.gz | cut -f1 | head -${DEVSMALLSIZE} > ${DEV_SRC}; \
zcat $@.shuffled.gz | cut -f2 | head -${DEVSMALLSIZE} > ${DEV_TRG}; \
mkdir -p ${dir ${TEST_SRC}}; \
echo "testset = top ${DEVSMALLSIZE} lines of ../val/${notdir $@}.shuffled!" >> ${dir ${TEST_SRC}}/README; \
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVSMALLSIZE} + 1)) > ${TEST_SRC}; \
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVSMALLSIZE} + 1)) > ${TEST_TRG}; \
fi; \
else \
echo "devset = top ${DEVSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README; \
zcat $@.shuffled.gz | cut -f1 | head -${DEVSIZE} > ${DEV_SRC}; \
zcat $@.shuffled.gz | cut -f2 | head -${DEVSIZE} > ${DEV_TRG}; \
mkdir -p ${dir ${TEST_SRC}}; \
echo "testset = second top ${DEVSIZE} lines of ../val/${notdir $@}.shuffled!" >> ${dir ${TEST_SRC}}/README; \
zcat $@.shuffled.gz | cut -f1 | head -$$((${DEVSIZE} + ${TESTSIZE})) | tail -${TESTSIZE} > ${TEST_SRC}; \
zcat $@.shuffled.gz | cut -f2 | head -$$((${DEVSIZE} + ${TESTSIZE})) | tail -${TESTSIZE} > ${TEST_TRG}; \
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVSIZE} + ${TESTSIZE})) | gzip -c > ${DEV_SRC}.notused.gz; \
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVSIZE} + ${TESTSIZE})) | gzip -c > ${DEV_TRG}.notused.gz; \
fi
else
echo "devset = top ${DEVSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README
zcat $@.shuffled.gz | cut -f1 | head -${DEVSIZE} > ${DEV_SRC}
zcat $@.shuffled.gz | cut -f2 | head -${DEVSIZE} > ${DEV_TRG}
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVSIZE} + 1)) | gzip -c > ${DEV_SRC}.notused.gz
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVSIZE} + 1)) | gzip -c > ${DEV_TRG}.notused.gz
endif
# zcat $@.shuffled.gz | cut -f1 | tail -${TESTSIZE} > ${TEST_SRC}; \
# zcat $@.shuffled.gz | cut -f2 | tail -${TESTSIZE} > ${TEST_TRG}; \
${DEV_TRG}: ${DEV_SRC}
@echo "done!"
### OLD: extract data from training data as dev/test set if the devdata is too small
### ---> this is confusing - skip this
###
### otherwise copy this directly after the target for ${DEV_SRC} above!
### and add dependency on train-data for ${DEV_SRC}.shuffled.gz like this:
### ${DEV_SRC}.shuffled.gz: ${TRAIN_SRC}.${PRE_SRC}.gz ${TRAIN_TRG}.${PRE_TRG}.gz
### and remove dependency on dev-data for ${LOCAL_TRAIN_SRC}, change
### ${LOCAL_TRAIN_SRC}: ${DEV_SRC} ${DEV_TRG} to
### ${LOCAL_TRAIN_SRC}:
#
# if (( `zcat $@.shuffled.gz | wc -l` < $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then \
# echo "Need more devdata - take some from traindata!"; \
# echo ".......... (1) extract top $$((${DEVSIZE} + ${TESTSIZE})) lines"; \
# echo "Too little dev/test data in ${DEVSET}!" >> ${dir $@}/README; \
# echo "Add top $$((${DEVSIZE} + ${TESTSIZE})) lines from ${DATASET} to dev/test" >> ${dir $@}/README; \
# echo "and remove those lines from training data" >> ${dir $@}/README; \
# zcat ${TRAIN_SRC}.${PRE_SRC}.gz | \
# head -$$((${DEVSIZE} + ${TESTSIZE})) | \
# sed 's/\@\@ //g' > $@.extra.${SRC}; \
# zcat ${TRAIN_TRG}.${PRE_TRG}.gz | \
# head -$$((${DEVSIZE} + ${TESTSIZE})) | \
# sed 's/\@\@ //g' > $@.extra.${TRG}; \
# echo ".......... (2) remaining lines for training"; \
# zcat ${TRAIN_SRC}.${PRE_SRC}.gz | \
# tail -n +$$((${DEVSIZE} + ${TESTSIZE} + 1)) | \
# sed 's/\@\@ //g' | gzip -c > $@.remaining.${SRC}.gz; \
# zcat ${TRAIN_TRG}.${PRE_TRG}.gz | \
# tail -n +$$((${DEVSIZE} + ${TESTSIZE} + 1)) | \
# sed 's/\@\@ //g' | gzip -c > $@.remaining.${TRG}.gz; \
# mv -f $@.remaining.${SRC}.gz ${TRAIN_SRC}.${PRE_SRC}.gz; \
# mv -f $@.remaining.${TRG}.gz ${TRAIN_TRG}.${PRE_TRG}.gz; \
# echo ".......... (3) append to devdata"; \
# mv $@.shuffled.gz $@.oldshuffled.gz; \
# paste $@.extra.${SRC} $@.extra.${TRG} > $@.shuffled; \
# zcat $@.oldshuffled.gz >> $@.shuffled; \
# rm $@.oldshuffled.gz; \
# gzip -f $@.shuffled; \
# rm -f $@.extra.${SRC} $@.extra.${TRG}; \
# fi
add-to-dev-data: ${CLEAN_DEV_SRC} ${CLEAN_DEV_TRG}
ifneq (${CLEAN_DEV_SRC},)
ifneq (${words ${TRGLANGS}},1)
echo "more than one target language";
zcat ${CLEAN_DEV_SRC} |\
sed "s/^/>>${TRG}<< /" >> ${DEV_SRC}
else
echo "only one target language"
zcat ${CLEAN_DEV_SRC} >> ${DEV_SRC}
endif
zcat ${CLEAN_DEV_TRG} >> ${DEV_TRG}
endif
####################
# test data
####################
##
## if devset and testset are from the same source:
## --> use part of the shuffled devset
## otherwise: create the testset
## exception: TESTSET exists in TESTSET_DIR
## --> just use that one
${TEST_SRC}: ${DEV_SRC}
ifneq (${TESTSET},${DEVSET})
mkdir -p ${dir $@}
rm -f ${TEST_SRC} ${TEST_TRG}
if [ -e ${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz ]; then \
${MAKE} CLEAN_TEST_SRC=${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz \
CLEAN_TEST_TRG=${TESTSET_DIR}/${TESTSET}.${TRG}.${PRE}.gz \
add-to-test-data; \
else \
for s in ${SRCLANGS}; do \
for t in ${TRGLANGS}; do \
${MAKE} SRC=$$s TRG=$$t add-to-test-data; \
done \
done; \
if [ ${TESTSIZE} -lt `cat $@ | wc -l` ]; then \
paste ${TEST_SRC} ${TEST_TRG} | shuf | gzip -c > $@.shuffled.gz; \
zcat $@.shuffled.gz | cut -f1 | tail -${TESTSIZE} > ${TEST_SRC}; \
zcat $@.shuffled.gz | cut -f2 | tail -${TESTSIZE} > ${TEST_TRG}; \
echo "testset = top ${TESTSIZE} lines of $@.shuffled!" >> ${dir $@}/README; \
fi \
fi
else
mkdir -p ${dir $@}
if [ -e ${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz ]; then \
${MAKE} CLEAN_TEST_SRC=${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz \
CLEAN_TEST_TRG=${TESTSET_DIR}/${TESTSET}.${TRG}.${PRE}.gz \
add-to-test-data; \
elif (( `zcat $<.shuffled.gz | wc -l` < $$((${DEVSIZE} + ${TESTSIZE})) )); then \
zcat $<.shuffled.gz | cut -f1 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_SRC}; \
zcat $<.shuffled.gz | cut -f2 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_TRG}; \
else \
zcat $<.shuffled.gz | cut -f1 | tail -${TESTSIZE} > ${TEST_SRC}; \
zcat $<.shuffled.gz | cut -f2 | tail -${TESTSIZE} > ${TEST_TRG}; \
fi
endif
${TEST_TRG}: ${TEST_SRC}
@echo "done!"
add-to-test-data: ${CLEAN_TEST_SRC}
ifneq (${CLEAN_TEST_SRC},)
ifneq (${words ${TRGLANGS}},1)
echo "more than one target language";
zcat ${CLEAN_TEST_SRC} |\
sed "s/^/>>${TRG}<< /" >> ${TEST_SRC}
else
echo "only one target language"
zcat ${CLEAN_TEST_SRC} >> ${TEST_SRC}
endif
zcat ${CLEAN_TEST_TRG} >> ${TEST_TRG}
endif
## reduce training data size if necessary
ifdef TRAINSIZE
${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz: ${TRAIN_SRC}.clean.${PRE_SRC}.gz
zcat $< | head -${TRAINSIZE} | gzip -c > $@
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz: ${TRAIN_TRG}.clean.${PRE_TRG}.gz
zcat $< | head -${TRAINSIZE} | gzip -c > $@
endif
# %.clean.gz: %.gz
# mkdir -p ${TMPDIR}/${LANGSTR}/cleanup
# gzip -cd < $< > ${TMPDIR}/${LANGSTR}/cleanup/$(notdir $@).${SRCEXT}
########################
# tune data
# TODO: do we use this?
########################
${TUNE_SRC}: ${TRAIN_SRC}
mkdir -p ${dir $@}
rm -f ${TUNE_SRC} ${TUNE_TRG}
-for s in ${SRCLANGS}; do \
for t in ${TRGLANGS}; do \
${MAKE} SRC=$$s TRG=$$t add-to-tune-data; \
done \
done
${TUNE_TRG}: ${TUNE_SRC}
@echo "done!"
add-to-tune-data: ${CLEAN_TUNE_SRC}
ifneq (${CLEAN_TUNE_SRC},)
ifneq (${words ${TRGLANGS}},1)
echo "more than one target language";
zcat ${CLEAN_TUNE_SRC} |\
sed "s/^/>>${TRG}<< /" >> ${TUNE_SRC}
else
echo "only one target language"
zcat ${CLEAN_TUNE_SRC} >> ${TUNE_SRC}
endif
zcat ${CLEAN_TUNE_TRG} >> ${TUNE_TRG}
endif
##----------------------------------------------
## tokenization
##----------------------------------------------
## normalisation for Chinese
%.zh_tw.tok: %.zh_tw.raw
$(LOAD_MOSES) cat $< |\
$(TOKENIZER)/replace-unicode-punctuation.perl |\
$(TOKENIZER)/remove-non-printing-char.perl |\
$(TOKENIZER)/normalize-punctuation.perl |\
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
%.zh_cn.tok: %.zh_cn.raw
$(LOAD_MOSES) cat $< |\
$(TOKENIZER)/replace-unicode-punctuation.perl |\
$(TOKENIZER)/remove-non-printing-char.perl |\
$(TOKENIZER)/normalize-punctuation.perl |\
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
%.zh.tok: %.zh.raw
$(LOAD_MOSES) cat $< |\
$(TOKENIZER)/replace-unicode-punctuation.perl |\
$(TOKENIZER)/remove-non-printing-char.perl |\
$(TOKENIZER)/normalize-punctuation.perl |\
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
## generic target for tokenization
%.tok: %.raw
$(LOAD_MOSES) cat $< |\
$(TOKENIZER)/replace-unicode-punctuation.perl |\
$(TOKENIZER)/remove-non-printing-char.perl |\
$(TOKENIZER)/normalize-punctuation.perl \
-l ${lastword ${subst 1,,${subst 2,,${subst ., ,$(<:.raw=)}}}} |\
$(TOKENIZER)/tokenizer.perl -a -threads $(THREADS) \
-l ${lastword ${subst 1,,${subst 2,,${subst ., ,$(<:.raw=)}}}} |\
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
## only normalisation
%.norm: %.raw
$(LOAD_MOSES) cat $< |\
$(TOKENIZER)/replace-unicode-punctuation.perl |\
$(TOKENIZER)/remove-non-printing-char.perl |\
$(TOKENIZER)/normalize-punctuation.perl |\
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
%.norm.gz: %.gz
$(LOAD_MOSES) zcat $< |\
$(TOKENIZER)/replace-unicode-punctuation.perl |\
$(TOKENIZER)/remove-non-printing-char.perl |\
$(TOKENIZER)/normalize-punctuation.perl |\
sed 's/ */ /g;s/^ *//g;s/ *$$//g' | gzip -c > $@
## increase max number of tokens to 250
## (TODO: should MIN_NTOKENS be 1?)
MIN_NR_TOKENS = 0
MAX_NR_TOKENS = 250
## apply the cleanup script from Moses
%.src.clean.${PRE_SRC}: %.src.${PRE_SRC} %.trg.${PRE_TRG}
rm -f $@.${SRCEXT} $<.${TRGEXT}
ln -s ${word 1,$^} $<.${SRCEXT}
ln -s ${word 2,$^} $<.${TRGEXT}
$(MOSESSCRIPTS)/training/clean-corpus-n.perl $< $(SRCEXT) $(TRGEXT) $@ ${MIN_NR_TOKENS} ${MAX_NR_TOKENS}
rm -f $<.${SRCEXT} $<.${TRGEXT}
mv $@.${SRCEXT} $@
mv $@.${TRGEXT} $(@:.src.clean.${PRE_SRC}=.trg.clean.${PRE_TRG})
%.trg.clean.${PRE_TRG}: %.src.clean.${PRE_SRC}
@echo "done!"
# tokenize testsets
testsets/%.raw: testsets/%.gz
gzip -cd < $< > $@
testsets/%.${PRE}.gz: testsets/%.${PRE}
gzip -c < $< > $@
ALLTEST = $(patsubst %.gz,%.${PRE}.gz,${sort $(subst .${PRE},,${wildcard testsets/*/*.??.gz})})
tokenize-testsets prepare-testsets: ${ALLTEST}
##----------------------------------------------
## BPE
##----------------------------------------------
## source/target specific bpe
## - make sure to leave the language flags alone!
## - make sure that we do not delete the BPE code files
## if the BPE models already exist
## ---> do not create new ones and always keep the old ones
## ---> need to delete the old ones if we want to create new BPE models
BPESRCMODEL = ${TRAIN_SRC}.bpe${SRCBPESIZE:000=}k-model
BPETRGMODEL = ${TRAIN_TRG}.bpe${TRGBPESIZE:000=}k-model
.PRECIOUS: ${BPESRCMODEL} ${BPETRGMODEL}
.INTERMEDIATE: ${LOCAL_TRAIN_SRC} ${LOCAL_TRAIN_TRG}
${BPESRCMODEL}: ${WORKDIR}/%.bpe${SRCBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
ifeq ($(wildcard ${BPESRCMODEL}),)
mkdir -p ${dir $@}
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
python3 ${SNMTPATH}/learn_bpe.py -s $(SRCBPESIZE) < $< > $@
else
cut -f2- -d ' ' $< > $<.text
python3 ${SNMTPATH}/learn_bpe.py -s $(SRCBPESIZE) < $<.text > $@
rm -f $<.text
endif
else
@echo "$@ already exists!"
@echo "WARNING! No new BPE model is created even though the data has changed!"
@echo "WARNING! Delete the file if you want to start from scratch!"
touch $@
endif
## no labels on the target language side
${BPETRGMODEL}: ${WORKDIR}/%.bpe${TRGBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
ifeq ($(wildcard ${BPETRGMODEL}),)
mkdir -p ${dir $@}
python3 ${SNMTPATH}/learn_bpe.py -s $(TRGBPESIZE) < $< > $@
else
@echo "$@ already exists!"
@echo "WARNING! No new BPE codes are created!"
@echo "WARNING! Delete the file if you want to start from scratch!"
touch $@
endif
%.src.bpe${SRCBPESIZE:000=}k: %.src ${BPESRCMODEL}
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
python3 ${SNMTPATH}/apply_bpe.py -c $(word 2,$^) < $< > $@
else
cut -f1 -d ' ' $< > $<.labels
cut -f2- -d ' ' $< > $<.text
python3 ${SNMTPATH}/apply_bpe.py -c $(word 2,$^) < $<.text > $@.text
paste -d ' ' $<.labels $@.text > $@
rm -f $<.labels $<.text $@.text
endif
%.trg.bpe${TRGBPESIZE:000=}k: %.trg ${BPETRGMODEL}
python3 ${SNMTPATH}/apply_bpe.py -c $(word 2,$^) < $< > $@
## this places @@ markers in front of punctuations
## if they appear to the right of the segment boundary
## (useful if we use BPE without tokenization)
%.segfix: %
perl -pe 's/(\P{P})\@\@ (\p{P})/$$1 \@\@$$2/g' < $< > $@
%.trg.txt: %.trg
mkdir -p ${dir $@}
mv $< $@
%.src.txt: %.src
mkdir -p ${dir $@}
mv $< $@
##----------------------------------------------
## sentence piece
##----------------------------------------------
SPMSRCMODEL = ${TRAIN_SRC}.spm${SRCBPESIZE:000=}k-model
SPMTRGMODEL = ${TRAIN_TRG}.spm${TRGBPESIZE:000=}k-model
.PRECIOUS: ${SPMSRCMODEL} ${SPMTRGMODEL}
${SPMSRCMODEL}: ${WORKDIR}/%.spm${SRCBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
ifeq ($(wildcard ${SPMSRCMODEL}),)
mkdir -p ${dir $@}
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
grep . $< > $<.text
${SPM_HOME}/spm_train \
--model_prefix=$@ --vocab_size=$(SRCBPESIZE) --input=$<.text \
--character_coverage=1.0 --hard_vocab_limit=false
else
cut -f2- -d ' ' $< | grep . > $<.text
${SPM_HOME}/spm_train \
--model_prefix=$@ --vocab_size=$(SRCBPESIZE) --input=$<.text \
--character_coverage=1.0 --hard_vocab_limit=false
endif
mv $@.model $@
rm -f $<.text
else
@echo "$@ already exists!"
@echo "WARNING! No new SPM model is created even though the data has changed!"
@echo "WARNING! Delete the file if you want to start from scratch!"
touch $@
endif
## no labels on the target language side
${SPMTRGMODEL}: ${WORKDIR}/%.spm${TRGBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
ifeq ($(wildcard ${SPMTRGMODEL}),)
mkdir -p ${dir $@}
grep . $< > $<.text
${SPM_HOME}/spm_train \
--model_prefix=$@ --vocab_size=$(TRGBPESIZE) --input=$<.text \
--character_coverage=1.0 --hard_vocab_limit=false
mv $@.model $@
rm -f $<.text
else
@echo "$@ already exists!"
@echo "WARNING! No new SPM model created!"
@echo "WARNING! Delete the file if you want to start from scratch!"
touch $@
endif
%.src.spm${SRCBPESIZE:000=}k: %.src ${SPMSRCMODEL}
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
${SPM_HOME}/spm_encode --model $(word 2,$^) < $< > $@
else
cut -f1 -d ' ' $< > $<.labels
cut -f2- -d ' ' $< > $<.text
${SPM_HOME}/spm_encode --model $(word 2,$^) < $<.text > $@.text
paste -d ' ' $<.labels $@.text > $@
rm -f $<.labels $<.text $@.text
endif
%.trg.spm${TRGBPESIZE:000=}k: %.trg ${SPMTRGMODEL}
${SPM_HOME}/spm_encode --model $(word 2,$^) < $< > $@
## document-level models (with guided alignment)
%.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz:
${MAKE} PRE_SRC=spm${SRCBPESIZE:000=}k PRE_TRG=spm${TRGBPESIZE:000=}k wordalign
./large-context.pl -l ${CONTEXT_SIZE} \
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,%.src.spm${SRCBPESIZE:000=}k.gz,$@} \
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,%.trg.spm${TRGBPESIZE:000=}k.gz,$@} \
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,%.spm${SRCBPESIZE:000=}k-spm${TRGBPESIZE:000=}k.src-trg.alg.gz,$@} \
| gzip > $@.tmp.gz
zcat $@.tmp.gz | cut -f1 | gzip -c > $@
zcat $@.tmp.gz | cut -f2 | gzip -c > ${subst .src.,.trg.,$@}
zcat $@.tmp.gz | cut -f3 | \
gzip -c > ${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,\
%.spm${SRCBPESIZE:000=}k.doc${CONTEXT_SIZE}-spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.src-trg.alg.gz,$@}
rm -f $@.tmp.gz
%.trg.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz: %.src.spm${SRCBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz
@echo "done!"
## for validation and test data:
%.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}:
${MAKE} PRE_SRC=spm${SRCBPESIZE:000=}k PRE_TRG=spm${TRGBPESIZE:000=}k devdata
${MAKE} PRE_SRC=spm${SRCBPESIZE:000=}k PRE_TRG=spm${TRGBPESIZE:000=}k testdata
./large-context.pl -l ${CONTEXT_SIZE} \
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE},%.src.spm${SRCBPESIZE:000=}k,$@} \
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE},%.trg.spm${TRGBPESIZE:000=}k,$@} \
| gzip > $@.tmp.gz
zcat $@.tmp.gz | cut -f1 > $@
zcat $@.tmp.gz | cut -f2 > ${subst .src.,.trg.,$@}
rm -f $@.tmp.gz
%.trg.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}: %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}
@echo "done!"
##----------------------------------------------
## get data from local space and compress ...
${WORKDIR}/%.clean.${PRE_SRC}.gz: ${TMPDIR}/${LANGSTR}/%.clean.${PRE_SRC}
mkdir -p ${dir $@}
gzip -c < $< > $@
ifneq (${PRE_SRC},${PRE_TRG})
${WORKDIR}/%.clean.${PRE_TRG}.gz: ${TMPDIR}/${LANGSTR}/%.clean.${PRE_TRG}
mkdir -p ${dir $@}
gzip -c < $< > $@
endif

115
Makefile.def Normal file
View File

@ -0,0 +1,115 @@
# -*-makefile-*-
# enable e-mail notification by setting EMAIL
WHOAMI = $(shell whoami)
ifeq ("$(WHOAMI)","tiedeman")
EMAIL = jorg.tiedemann@helsinki.fi
endif
# job-specific settings (overwrite if necessary)
# HPC_EXTRA: additional SBATCH commands
CPU_MODULES = gcc/6.2.0 mkl
GPU_MODULES = cuda-env/8 mkl
# GPU_MODULES = python-env/3.5.3-ml cuda-env/8 mkl
# GPU = k80
GPU = p100
NR_GPUS = 1
HPC_MEM = 8g
HPC_NODES = 1
HPC_CORES = 1
HPC_DISK = 500
HPC_QUEUE = serial
# HPC_MODULES = nlpl-opus python-env/3.4.1 efmaral moses
# HPC_MODULES = nlpl-opus moses cuda-env marian python-3.5.3-ml
HPC_MODULES = ${GPU_MODULES}
HPC_EXTRA =
WALLTIME = 72
DEVICE = cuda
LOADCPU = module load ${CPU_MODULES}
LOADGPU = module load ${GPU_MODULES}
MARIAN_WORKSPACE = 13000
ifeq (${shell hostname},dx6-ibs-p2)
APPLHOME = /opt/tools
WORKHOME = ${shell realpath ${PWD}/work}
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
MOSESHOME = ${APPLHOME}/mosesdecoder
MARIAN = ${APPLHOME}/marian/build
LOADMODS = echo "nothing to load"
MARIAN_WORKSPACE = 10000
else ifeq (${shell hostname},dx7-nkiel-4gpu)
APPLHOME = /opt/tools
WORKHOME = ${shell realpath ${PWD}/work}
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
MOSESHOME = ${APPLHOME}/mosesdecoder
MARIAN = ${APPLHOME}/marian/build
LOADMODS = echo "nothing to load"
MARIAN_WORKSPACE = 10000
else ifneq ($(wildcard /wrk/tiedeman/research/),)
DATAHOME = /proj/OPUS/WMT19/data/${LANGPAIR}
# APPLHOME = ${USERAPPL}/tools
APPLHOME = /proj/memad/tools
WORKHOME = /wrk/tiedeman/research/marian/${SRC}-${TRG}
OPUSHOME = /proj/nlpl/data/OPUS
MOSESHOME = /proj/nlpl/software/moses/4.0-65c75ff/moses
# MARIAN = /proj/nlpl/software/marian/1.2.0
# MARIAN = /appl/ling/marian
MARIAN = ${HOME}/appl_taito/tools/marian/build-gpu
MARIANCPU = ${HOME}/appl_taito/tools/marian/build-cpu
LOADMODS = ${LOADGPU}
else
# CSCPROJECT = project_2001194
CSCPROJECT = project_2000309
DATAHOME = ${HOME}/work/opentrans/data/${LANGPAIR}
WORKHOME = ${shell realpath ${PWD}/work}
APPLHOME = ${HOME}/projappl
OPUSHOME = /scratch/project_2000661/nlpl/data/OPUS
MOSESHOME = ${APPLHOME}/mosesdecoder
EFLOMAL_HOME = ${APPLHOME}/eflomal/
MARIAN = ${APPLHOME}/marian/build
MARIANCPU = ${APPLHOME}/marian/build
# GPU_MODULES = cuda intel-mkl
GPU = v100
GPU_MODULES = python-env
CPU_MODULES = python-env
LOADMODS = echo "nothing to load"
HPC_QUEUE = small
MARIAN_WORKSPACE = 30000
endif
ifdef LOCAL_SCRATCH
TMPDIR = ${LOCAL_SCRATCH}
endif
WORDALIGN = ${EFLOMAL_HOME}align.py
ATOOLS = ${FASTALIGN_HOME}atools
MULTEVALHOME = ${APPLHOME}/multeval
MOSESSCRIPTS = ${MOSESHOME}/scripts
TOKENIZER = ${MOSESSCRIPTS}/tokenizer
SNMTPATH = ${APPLHOME}/subword-nmt/subword_nmt
# sorted languages and langpair used to match resources in OPUS
SORTLANGS = $(sort ${SRC} ${TRG})
LANGPAIR = ${firstword ${SORTLANGS}}-${lastword ${SORTLANGS}}
## for same language pairs: add numeric extension
ifeq (${SRC},$(TRG))
SRCEXT = ${SRC}1
TRGEXT = ${SRC}2
else
SRCEXT = ${SRC}
TRGEXT = ${TRG}
endif

351
Makefile.dist Normal file
View File

@ -0,0 +1,351 @@
# -*-makefile-*-
#
# make distribution packages
# and upload them to cPouta ObjectStorage
#
MODELSHOME = ${WORKHOME}/models
DIST_PACKAGE = ${MODELSHOME}/${LANGSTR}/${DATASET}.zip
## minimum BLEU score for models to be accepted as distribution package
MIN_BLEU_SCORE = 20
.PHONY: dist
dist: ${DIST_PACKAGE}
.PHONY: scores
scores: ${WORKHOME}/eval/scores.txt
## get the best model from all kind of alternative setups
## in the following sub directories (add prefix work-)
# ALT_MODEL_DIR = bpe-old bpe-memad bpe spm-noalign bpe-align spm
ALT_MODEL_DIR = spm
best_dist_all:
for l in $(sort ${shell ls work* | grep -- '-' | grep -v old | grep -v work}); do \
if [ `find work*/$$l -name '${DATASET}${TRAINSIZE}.*.npz' | wc -l` -gt 0 ]; then \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" best_dist; \
fi \
done
## find the best model according to test set scores
## and make a distribution package from that model
## (BLEU needs to be above MIN_BLEU_SCORE)
## NEW: don't trust models tested with GNOME test sets!
best_dist:
@m=0;\
s=''; \
echo "------------------------------------------------"; \
echo "search best model for ${LANGSTR}"; \
for d in ${ALT_MODEL_DIR}; do \
e=`ls work-$$d/${LANGSTR}/val/*.trg | xargs basename | sed 's/\.trg//'`; \
echo "evaldata = $$e"; \
if [ "$$e" != "GNOME" ]; then \
if ls work-$$d/${LANGSTR}/$$e*.eval 1> /dev/null 2>&1; then \
b=`grep 'BLEU+' work-$$d/${LANGSTR}/$$e*.eval | cut -f3 -d' '`; \
if (( $$(echo "$$m-$$b < 1" |bc -l) )); then \
echo "$$d ($$b) is better or not much worse than $$s ($$m)!"; \
m=$$b; \
s=$$d; \
else \
echo "$$d ($$b) is worse than $$s ($$m)!"; \
fi \
fi \
fi \
done; \
echo "------------------------------------------------"; \
if [ "$$s" != "" ]; then \
if (( $$(echo "$$m > ${MIN_BLEU_SCORE}" |bc -l) )); then \
${MAKE} MODELSHOME=${PWD}/models \
MODELS_URL=https://object.pouta.csc.fi/OPUS-MT-models dist-$$s; \
fi; \
fi
## make a package for distribution
## old: only accept models with a certain evaluation score:
# if [ `grep BLEU $(TEST_EVALUATION) | cut -f3 -d ' ' | cut -f1 -d '.'` -ge ${MIN_BLEU_SCORE} ]; then \
DATE = ${shell date +%F}
MODELS_URL = https://object.pouta.csc.fi/OPUS-MT-dev
SKIP_DIST_EVAL = 0
${DIST_PACKAGE}: ${MODEL_FINAL}
ifneq (${SKIP_DIST_EVAL},1)
@${MAKE} $(TEST_EVALUATION)
@${MAKE} $(TEST_COMPARISON)
endif
@mkdir -p ${dir $@}
@touch ${WORKDIR}/source.tcmodel
@echo "# $(notdir ${@:.zip=})-${DATE}.zip" > ${WORKDIR}/README.md
@echo '' >> ${WORKDIR}/README.md
@echo "* dataset: ${DATASET}" >> ${WORKDIR}/README.md
@echo "* model: ${MODELTYPE}" >> ${WORKDIR}/README.md
@if [ -e ${BPESRCMODEL} ]; then \
echo "* pre-processing: normalization + tokenization + BPE" >> ${WORKDIR}/README.md; \
cp ${BPESRCMODEL} ${WORKDIR}/source.bpe; \
cp ${BPETRGMODEL} ${WORKDIR}/target.bpe; \
cp preprocess-bpe.sh ${WORKDIR}/preprocess.sh; \
cp postprocess-bpe.sh ${WORKDIR}/postprocess.sh; \
elif [ -e ${SPMSRCMODEL} ]; then \
echo "* pre-processing: normalization + SentencePiece" >> ${WORKDIR}/README.md; \
cp ${SPMSRCMODEL} ${WORKDIR}/source.spm; \
cp ${SPMTRGMODEL} ${WORKDIR}/target.spm; \
cp preprocess-spm.sh ${WORKDIR}/preprocess.sh; \
cp postprocess-spm.sh ${WORKDIR}/postprocess.sh; \
fi
@if [ ${words ${TRGLANGS}} -gt 1 ]; then \
echo '* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)' \
>> ${WORKDIR}/README.md; \
fi
@echo "* download: [$(notdir ${@:.zip=})-${DATE}.zip](${MODELS_URL}/${LANGSTR}/$(notdir ${@:.zip=})-${DATE}.zip)" >> ${WORKDIR}/README.md
if [ -e $(TEST_EVALUATION) ]; then \
echo "* test set translations: [$(notdir ${@:.zip=})-${DATE}.test.txt](${MODELS_URL}/${LANGSTR}/$(notdir ${@:.zip=})-${DATE}.test.txt)" >> ${WORKDIR}/README.md; \
echo "* test set scores: [$(notdir ${@:.zip=})-${DATE}.eval.txt](${MODELS_URL}/${LANGSTR}/$(notdir ${@:.zip=})-${DATE}.eval.txt)" >> ${WORKDIR}/README.md; \
echo '' >> ${WORKDIR}/README.md; \
echo '## Benchmarks' >> ${WORKDIR}/README.md; \
echo '' >> ${WORKDIR}/README.md; \
cd ${WORKDIR}; \
grep -H BLEU *k${NR}.*eval | \
tr '.' '/' | cut -f1,5,6 -d '/' | tr '/' "." > $@.1; \
grep BLEU *k${NR}.*eval | cut -f3 -d ' ' > $@.2; \
grep chrF *k${NR}.*eval | cut -f3 -d ' ' > $@.3; \
echo '| testset | BLEU | chr-F |' >> README.md; \
echo '|-----------------------|-------|-------|' >> README.md; \
paste $@.1 $@.2 $@.3 | sed "s/\t/ | /g;s/^/| /;s/$$/ |/" >> README.md; \
rm -f $@.1 $@.2 $@.3; \
fi
@cat ${WORKDIR}/README.md >> ${dir $@}README.md
@echo '' >> ${dir $@}README.md
@cp LICENSE ${WORKDIR}/
@chmod +x ${WORKDIR}/preprocess.sh
@sed -e 's# - /.*/\([^/]*\)$$# - \1#' \
-e 's/beam-size: [0-9]*$$/beam-size: 6/' \
-e 's/mini-batch: [0-9]*$$/mini-batch: 1/' \
-e 's/maxi-batch: [0-9]*$$/maxi-batch: 1/' \
-e 's/relative-paths: false/relative-paths: true/' \
< ${MODEL_DECODER} > ${WORKDIR}/decoder.yml
@cd ${WORKDIR} && zip ${notdir $@} \
README.md LICENSE \
${notdir ${MODEL_FINAL}} \
${notdir ${MODEL_VOCAB}} \
${notdir ${MODEL_VALIDLOG}} \
${notdir ${MODEL_TRAINLOG}} \
source.* target.* decoder.yml preprocess.sh postprocess.sh
@mkdir -p ${dir $@}
@mv -f ${WORKDIR}/${notdir $@} ${@:.zip=}-${DATE}.zip
if [ -e $(TEST_EVALUATION) ]; then \
cp $(TEST_EVALUATION) ${@:.zip=}-${DATE}.eval.txt; \
cp $(TEST_COMPARISON) ${@:.zip=}-${DATE}.test.txt; \
fi
@rm -f $@
@cd ${dir $@} && ln -s $(notdir ${@:.zip=})-${DATE}.zip ${notdir $@}
@rm -f ${WORKDIR}/decoder.yml ${WORKDIR}/source.* ${WORKDIR}/target.*
@rm -f ${WORKDIR}/preprocess.sh ${WORKDIR}/postprocess.sh
EVALSCORES = ${patsubst ${WORKHOME}/%.eval,${WORKHOME}/eval/%.eval.txt,${wildcard ${WORKHOME}/*/*.eval}}
EVALTRANSL = ${patsubst ${WORKHOME}/%.compare,${WORKHOME}/eval/%.test.txt,${wildcard ${WORKHOME}/*/*.compare}}
## upload to Object Storage
## Don't forget to run this before uploading!
# source project_2000661-openrc.sh
#
# - make upload ......... released models = all sub-dirs in models/
# - make upload-models .. trained models in current WORKHOME to OPUS-MT-dev
# - make upload-scores .. score file with benchmark results to OPUS-MT-eval
# - make upload-eval .... benchmark tests from models in WORKHOME
# - make upload-images .. images of VMs that run OPUS-MT
upload:
cd models && swift upload OPUS-MT-models --changed --skip-identical *
swift post OPUS-MT-models --read-acl ".r:*"
swift list OPUS-MT-models > index.txt
swift upload OPUS-MT-models index.txt
rm -f index.txt
upload-models:
cd ${WORKHOME} && swift upload OPUS-MT-dev --changed --skip-identical models
swift post OPUS-MT-dev --read-acl ".r:*"
swift list OPUS-MT-dev > index.txt
swift upload OPUS-MT-dev index.txt
rm -f index.txt
upload-scores: ${WORKHOME}/eval/scores.txt
cd ${WORKHOME} && swift upload OPUS-MT-eval --changed --skip-identical eval/scores.txt
swift post OPUS-MT-eval --read-acl ".r:*"
upload-eval: ${EVALSCORES} ${EVALTRANSL}
cd ${WORKHOME} && swift upload OPUS-MT-eval --changed --skip-identical eval
swift post OPUS-MT-eval --read-acl ".r:*"
upload-images:
cd ${WORKHOME} && swift upload OPUS-MT --changed --skip-identical \
--use-slo --segment-size 5G opusMT-images
swift post OPUS-MT-images --read-acl ".r:*"
## this is for the multeval scores
# ${WORKHOME}/eval/scores.txt: ${EVALSCORES}
# cd ${WORKHOME} && \
# grep base */*eval | cut -f1,2- -d '/' | cut -f1,6- -d '.' | \
# sed 's/-/ /' | sed 's/\// /' | sed 's/ ([^)]*)//g' |\
# sed 's/.eval:baseline//' | sed "s/ */\t/g" | sort > $@
${WORKHOME}/eval/scores.txt: ${EVALSCORES}
cd ${WORKHOME} && grep BLEU */*k${NR}.*eval | cut -f1 -d '/' | tr '-' "\t" > $@.1
cd ${WORKHOME} && grep BLEU */*k${NR}.*eval | tr '.' '/' | cut -f2,6,7 -d '/' | tr '/' "." > $@.2
cd ${WORKHOME} && grep BLEU */*k${NR}.*eval | cut -f3 -d ' ' > $@.3
cd ${WORKHOME} && grep chrF */*k${NR}.*eval | cut -f3 -d ' ' > $@.4
paste $@.1 $@.2 $@.3 $@.4 > $@
rm -f $@.1 $@.2 $@.3 $@.4
${EVALSCORES}: # ${WORKHOME}/eval/%.eval.txt: ${WORKHOME}/models/%.eval
mkdir -p ${dir $@}
cp ${patsubst ${WORKHOME}/eval/%.eval.txt,${WORKHOME}/%.eval,$@} $@
# cp $< $@
${EVALTRANSL}: # ${WORKHOME}/eval/%.test.txt: ${WORKHOME}/models/%.compare
mkdir -p ${dir $@}
cp ${patsubst ${WORKHOME}/eval/%.test.txt,${WORKHOME}/%.compare,$@} $@
# cp $< $@
# ## dangerous area ....
# delete-eval:
# swift delete OPUS-MT eval
######################################################################
## handle old models in previous work directories
## obsolete now?
######################################################################
##-----------------------------------
## make packages from trained models
## check old-models as well!
TRAINED_NEW_MODELS = ${patsubst ${WORKHOME}/%/,%,${dir ${wildcard ${WORKHOME}/*/*.best-perplexity.npz}}}
# TRAINED_OLD_MODELS = ${patsubst ${WORKHOME}/old-models/%/,%,${dir ${wildcard ${WORKHOME}/old-models/*/*.best-perplexity.npz}}}
TRAINED_OLD_MODELS = ${patsubst ${WORKHOME}/old-models/%/,%,${dir ${wildcard ${WORKHOME}/old-models/??-??/*.best-perplexity.npz}}}
TRAINED_OLD_ONLY_MODELS = ${filter-out ${TRAINED_NEW_MODELS},${TRAINED_OLD_MODELS}}
TRAINED_NEW_ONLY_MODELS = ${filter-out ${TRAINED_OLD_MODELS},${TRAINED_NEW_MODELS}}
TRAINED_DOUBLE_MODELS = ${filter ${TRAINED_NEW_MODELS},${TRAINED_OLD_MODELS}}
## make packages of all new models
## unless there are better models in old-models
new-models-dist:
@echo "nr of extra models: ${words ${TRAINED_NEW_ONLY_MODELS}}"
for l in ${TRAINED_NEW_ONLY_MODELS}; do \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" dist; \
done
@echo "trained double ${words ${TRAINED_DOUBLE_MODELS}}"
for l in ${TRAINED_DOUBLE_MODELS}; do \
n=`grep 'new best' work/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
o=`grep 'new best' work/old-models/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
if (( $$(echo "$$n < $$o" |bc -l) )); then \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" dist; \
fi \
done
## fix decoder path in old-models (to run evaluations
fix-decoder-path:
for l in ${wildcard ${WORKHOME}/old-models/*/*.best-perplexity.npz.decoder.yml}; do \
sed --in-place=.backup 's#/\(..-..\)/opus#/old-models/\1/opus#' $$l; \
sed --in-place=.backup2 's#/old-models/old-models/#/old-models/#' $$l; \
sed --in-place=.backup2 's#/old-models/old-models/#/old-models/#' $$l; \
done
## make packages of all old models from old-models
## unless there are better models in work (new models)
old-models-dist:
@echo "nr of extra models: ${words ${TRAINED_OLD_ONLY_MODELS}}"
for l in ${TRAINED_OLD_ONLY_MODELS}; do \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" \
WORKHOME=${WORKHOME}/old-models \
MODELSHOME=${WORKHOME}/models dist; \
done
@echo "trained double ${words ${TRAINED_DOUBLE_MODELS}}"
for l in ${TRAINED_DOUBLE_MODELS}; do \
n=`grep 'new best' work/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
o=`grep 'new best' work/old-models/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
if (( $$(echo "$$o < $$n" |bc -l) )); then \
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" \
WORKHOME=${WORKHOME}/old-models \
MODELSHOME=${WORKHOME}/models dist; \
else \
echo "$$l: new better than old"; \
fi \
done
## old models had slightly different naming conventions
LASTSRC = ${lastword ${SRCLANGS}}
LASTTRG = ${lastword ${TRGLANGS}}
MODEL_OLD = ${MODEL_SUBDIR}${DATASET}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}.${LASTSRC}${LASTTRG}
MODEL_OLD_BASENAME = ${MODEL_OLD}.${MODELTYPE}.model${NR}
MODEL_OLD_FINAL = ${WORKDIR}/${MODEL_OLD_BASENAME}.npz.best-perplexity.npz
MODEL_OLD_VOCAB = ${WORKDIR}/${MODEL_OLD}.vocab.${MODEL_VOCABTYPE}
MODEL_OLD_DECODER = ${MODEL_OLD_FINAL}.decoder.yml
MODEL_TRANSLATE = ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}
MODEL_OLD_TRANSLATE = ${WORKDIR}/${TESTSET}.${MODEL_OLD}${NR}.${MODELTYPE}.${SRC}.${TRG}
MODEL_OLD_VALIDLOG = ${MODEL_OLD}.${MODELTYPE}.valid${NR}.log
MODEL_OLD_TRAINLOG = ${MODEL_OLD}.${MODELTYPE}.train${NR}.log
link-old-models:
if [ ! -e ${MODEL_FINAL} ]; then \
if [ -e ${MODEL_OLD_FINAL} ]; then \
ln -s ${MODEL_OLD_FINAL} ${MODEL_FINAL}; \
ln -s ${MODEL_OLD_VOCAB} ${MODEL_VOCAB}; \
ln -s ${MODEL_OLD_DECODER} ${MODEL_DECODER}; \
fi \
fi
if [ ! -e ${MODEL_TRANSLATE} ]; then \
if [ -e ${MODEL_OLD_TRANSLATE} ]; then \
ln -s ${MODEL_OLD_TRANSLATE} ${MODEL_TRANSLATE}; \
fi \
fi
if [ ! -e ${WORKDIR}/${MODEL_VALIDLOG} ]; then \
if [ -e ${WORKDIR}/${MODEL_OLD_VALIDLOG} ]; then \
ln -s ${WORKDIR}/${MODEL_OLD_VALIDLOG} ${WORKDIR}/${MODEL_VALIDLOG}; \
ln -s ${WORKDIR}/${MODEL_OLD_TRAINLOG} ${WORKDIR}/${MODEL_TRAINLOG}; \
fi \
fi
rm -f ${MODEL_TRANSLATE}.eval
rm -f ${MODEL_TRANSLATE}.compare

60
Makefile.doclevel Normal file
View File

@ -0,0 +1,60 @@
# -*-makefile-*-
DOCLEVEL_BENCHMARK_DATA = https://zenodo.org/record/3525366/files/doclevel-MT-benchmark-discomt2019.zip
## use the doclevel benchmark data sets
%-ost:
${MAKE} ost-datasets
${MAKE} SRCLANGS=en TRGLANGS=de \
TRAINSET=ost-train \
DEVSET=ost-dev \
TESTSET=ost-test \
DEVSIZE=100000 TESTSIZE=100000 HELDOUTSIZE=0 \
${@:-ost=}
ost-datasets: ${DATADIR}/${PRE}/ost-train.de-en.clean.de.gz \
${DATADIR}/${PRE}/ost-train.de-en.clean.en.gz \
${DATADIR}/${PRE}/ost-dev.de-en.clean.de.gz \
${DATADIR}/${PRE}/ost-dev.de-en.clean.en.gz \
${DATADIR}/${PRE}/ost-test.de-en.clean.de.gz \
${DATADIR}/${PRE}/ost-test.de-en.clean.en.gz
.INTERMEDIATE: ${WORKHOME}/doclevel-MT-benchmark
## download the doc-level data set
${WORKHOME}/doclevel-MT-benchmark:
wget -O $@.zip DOCLEVEL_BENCHMARK_DATA?download=1
unzip -d ${dir $@} $@.zip
rm -f $@.zip
${DATADIR}/${PRE}/ost-train.de-en.clean.de.gz: ${WORKHOME}/doclevel-MT-benchmark
mkdir -p ${dir $@}
$(TOKENIZER)/detokenizer.perl -l de < $</train/ost.tok.de | gzip -c > $@
${DATADIR}/${PRE}/ost-train.de-en.clean.en.gz: ${WORKHOME}/doclevel-MT-benchmark
mkdir -p ${dir $@}
$(TOKENIZER)/detokenizer.perl -l en < $</train/ost.tok.en | gzip -c > $@
${DATADIR}/${PRE}/ost-dev.de-en.clean.de.gz: ${WORKHOME}/doclevel-MT-benchmark
mkdir -p ${dir $@}
$(TOKENIZER)/detokenizer.perl -l de < $</dev/ost.tok.de | gzip -c > $@
${DATADIR}/${PRE}/ost-dev.de-en.clean.en.gz: ${WORKHOME}/doclevel-MT-benchmark
mkdir -p ${dir $@}
$(TOKENIZER)/detokenizer.perl -l en < $</dev/ost.tok.en | gzip -c > $@
${DATADIR}/${PRE}/ost-test.de-en.clean.de.gz: ${WORKHOME}/doclevel-MT-benchmark
mkdir -p ${dir $@}
$(TOKENIZER)/detokenizer.perl -l de < $</test/ost.tok.de | gzip -c > $@
${DATADIR}/${PRE}/ost-test.de-en.clean.en.gz: ${WORKHOME}/doclevel-MT-benchmark
mkdir -p ${dir $@}
$(TOKENIZER)/detokenizer.perl -l en < $</test/ost.tok.en | gzip -c > $@

126
Makefile.env Normal file
View File

@ -0,0 +1,126 @@
# -*-makefile-*-
#
# settings of the environment
# - essential tools and their paths
# - system-specific settings
#
## modules to be loaded in sbatch scripts
CPU_MODULES = gcc/6.2.0 mkl
GPU_MODULES = cuda-env/8 mkl
# GPU_MODULES = python-env/3.5.3-ml cuda-env/8 mkl
# job-specific settings (overwrite if necessary)
# HPC_EXTRA: additional SBATCH commands
NR_GPUS = 1
HPC_NODES = 1
HPC_DISK = 500
HPC_QUEUE = serial
# HPC_MODULES = nlpl-opus python-env/3.4.1 efmaral moses
# HPC_MODULES = nlpl-opus moses cuda-env marian python-3.5.3-ml
HPC_MODULES = ${GPU_MODULES}
HPC_EXTRA =
MEM = 4g
THREADS = 1
WALLTIME = 72
## set variables with HPC prefix
ifndef HPC_TIME
HPC_TIME = ${WALLTIME}:00
endif
ifndef HPC_CORES
HPC_CORES = ${THREADS}
endif
ifndef HPC_MEM
HPC_MEM = ${MEM}
endif
# GPU = k80
GPU = p100
DEVICE = cuda
LOADCPU = module load ${CPU_MODULES}
LOADGPU = module load ${GPU_MODULES}
ifeq (${shell hostname},dx6-ibs-p2)
APPLHOME = /opt/tools
WORKHOME = ${shell realpath ${PWD}/work-spm}
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
MOSESHOME = ${APPLHOME}/mosesdecoder
MARIAN = ${APPLHOME}/marian/build
LOADMODS = echo "nothing to load"
else ifeq (${shell hostname},dx7-nkiel-4gpu)
APPLHOME = /opt/tools
WORKHOME = ${shell realpath ${PWD}/work-spm}
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
MOSESHOME = ${APPLHOME}/mosesdecoder
MARIAN = ${APPLHOME}/marian/build
LOADMODS = echo "nothing to load"
else ifneq ($(wildcard /wrk/tiedeman/research/),)
DATAHOME = /proj/OPUS/WMT19/data/${LANGPAIR}
# APPLHOME = ${USERAPPL}/tools
APPLHOME = /proj/memad/tools
WORKHOME = /wrk/tiedeman/research/Opus-MT/work-spm
OPUSHOME = /proj/nlpl/data/OPUS
MOSESHOME = /proj/nlpl/software/moses/4.0-65c75ff/moses
# MARIAN = /proj/nlpl/software/marian/1.2.0
# MARIAN = /appl/ling/marian
MARIAN = ${HOME}/appl_taito/tools/marian/build-gpu
MARIANCPU = ${HOME}/appl_taito/tools/marian/build-cpu
LOADMODS = ${LOADGPU}
else
CSCPROJECT = project_2001194
# CSCPROJECT = project_2000309
DATAHOME = ${HOME}/work/opentrans/data/${LANGPAIR}
WORKHOME = ${shell realpath ${PWD}/work-spm}
APPLHOME = ${HOME}/projappl
# OPUSHOME = /scratch/project_2000661/nlpl/data/OPUS
OPUSHOME = /projappl/nlpl/data/OPUS
MOSESHOME = ${APPLHOME}/mosesdecoder
EFLOMAL_HOME = ${APPLHOME}/eflomal/
# MARIAN = ${APPLHOME}/marian/build
# MARIANCPU = ${APPLHOME}/marian/build
MARIAN = ${APPLHOME}/marian-dev/build-spm
MARIANCPU = ${APPLHOME}/marian-dev/build-cpu
MARIANSPM = ${APPLHOME}/marian-dev/build-spm
# GPU_MODULES = cuda intel-mkl
GPU = v100
GPU_MODULES = python-env
# gcc/8.3.0 boost/1.68.0-mpi intel-mkl
CPU_MODULES = python-env
LOADMODS = echo "nothing to load"
HPC_QUEUE = small
endif
ifdef LOCAL_SCRATCH
TMPDIR = ${LOCAL_SCRATCH}
endif
## other tools and their locations
WORDALIGN = ${EFLOMAL_HOME}align.py
ATOOLS = ${FASTALIGN_HOME}atools
MULTEVALHOME = ${APPLHOME}/multeval
MOSESSCRIPTS = ${MOSESHOME}/scripts
TOKENIZER = ${MOSESSCRIPTS}/tokenizer
SNMTPATH = ${APPLHOME}/subword-nmt/subword_nmt
## SentencePiece
SPM_HOME = ${MARIANSPM}

99
Makefile.slurm Normal file
View File

@ -0,0 +1,99 @@
# -*-makefile-*-
# enable e-mail notification by setting EMAIL
WHOAMI = $(shell whoami)
ifeq ("$(WHOAMI)","tiedeman")
EMAIL = jorg.tiedemann@helsinki.fi
endif
##---------------------------------------------
## submit jobs
##---------------------------------------------
## submit job to gpu queue
%.submit:
mkdir -p ${WORKDIR}
echo '#!/bin/bash -l' > $@
echo '#SBATCH -J "${DATASET}-${@:.submit=}"' >>$@
echo '#SBATCH -o ${DATASET}-${@:.submit=}.out.%j' >> $@
echo '#SBATCH -e ${DATASET}-${@:.submit=}.err.%j' >> $@
echo '#SBATCH --mem=${HPC_MEM}' >> $@
# echo '#SBATCH --exclude=r18g05' >> $@
ifdef EMAIL
echo '#SBATCH --mail-type=END' >> $@
echo '#SBATCH --mail-user=${EMAIL}' >> $@
endif
echo '#SBATCH -n 1' >> $@
echo '#SBATCH -N 1' >> $@
echo '#SBATCH -p gpu' >> $@
ifeq (${shell hostname --domain},bullx)
echo '#SBATCH --account=${CSCPROJECT}' >> $@
echo '#SBATCH --gres=gpu:${GPU}:${NR_GPUS},nvme:${HPC_DISK}' >> $@
else
echo '#SBATCH --gres=gpu:${GPU}:${NR_GPUS}' >> $@
endif
echo '#SBATCH -t ${HPC_TIME}:00' >> $@
echo 'module use -a /proj/nlpl/modules' >> $@
for m in ${GPU_MODULES}; do \
echo "module load $$m" >> $@; \
done
echo 'module list' >> $@
echo 'cd $${SLURM_SUBMIT_DIR:-.}' >> $@
echo 'pwd' >> $@
echo 'echo "Starting at `date`"' >> $@
echo 'srun ${MAKE} ${MAKEARGS} ${@:.submit=}' >> $@
echo 'echo "Finishing at `date`"' >> $@
sbatch $@
mkdir -p ${WORKDIR}
mv $@ ${WORKDIR}/$@
# echo 'srun ${MAKE} NR=${NR} MODELTYPE=${MODELTYPE} DATASET=${DATASET} SRC=${SRC} TRG=${TRG} PRE_SRC=${PRE_SRC} PRE_TRG=${PRE_TRG} ${MAKEARGS} ${@:.submit=}' >> $@
## submit job to cpu queue
%.submitcpu:
mkdir -p ${WORKDIR}
echo '#!/bin/bash -l' > $@
echo '#SBATCH -J "${@:.submitcpu=}"' >>$@
echo '#SBATCH -o ${@:.submitcpu=}.out.%j' >> $@
echo '#SBATCH -e ${@:.submitcpu=}.err.%j' >> $@
echo '#SBATCH --mem=${HPC_MEM}' >> $@
ifdef EMAIL
echo '#SBATCH --mail-type=END' >> $@
echo '#SBATCH --mail-user=${EMAIL}' >> $@
endif
ifeq (${shell hostname --domain},bullx)
echo '#SBATCH --account=${CSCPROJECT}' >> $@
echo '#SBATCH --gres=nvme:${HPC_DISK}' >> $@
# echo '#SBATCH --exclude=r05c49' >> $@
# echo '#SBATCH --exclude=r07c51' >> $@
# echo '#SBATCH --exclude=r06c50' >> $@
endif
echo '#SBATCH -n ${HPC_CORES}' >> $@
echo '#SBATCH -N ${HPC_NODES}' >> $@
echo '#SBATCH -p ${HPC_QUEUE}' >> $@
echo '#SBATCH -t ${HPC_TIME}:00' >> $@
echo '${HPC_EXTRA}' >> $@
echo 'module use -a /proj/nlpl/modules' >> $@
for m in ${CPU_MODULES}; do \
echo "module load $$m" >> $@; \
done
echo 'module list' >> $@
echo 'cd $${SLURM_SUBMIT_DIR:-.}' >> $@
echo 'pwd' >> $@
echo 'echo "Starting at `date`"' >> $@
echo '${MAKE} -j ${HPC_CORES} ${MAKEARGS} ${@:.submitcpu=}' >> $@
echo 'echo "Finishing at `date`"' >> $@
sbatch $@
mkdir -p ${WORKDIR}
mv $@ ${WORKDIR}/$@
# echo '${MAKE} -j ${HPC_CORES} DATASET=${DATASET} SRC=${SRC} TRG=${TRG} PRE_SRC=${PRE_SRC} PRE_TRG=${PRE_TRG} ${MAKEARGS} ${@:.submitcpu=}' >> $@

485
Makefile.tasks Normal file
View File

@ -0,0 +1,485 @@
# -*-makefile-*-
#
# pre-defined tasks that we might want to run
#
MEMAD_LANGS = de en fi fr nl sv
# GERMANIC = en de nl fy af da fo is no nb nn sv
GERMANIC = de nl fy af da fo is no nb nn sv
WESTGERMANIC = de nl af fy
SCANDINAVIAN = da fo is no nb nn sv
ROMANCE = ca es fr gl it la oc pt_br pt ro
FINNO_UGRIC = fi et hu
PIVOT = en
ifndef LANGS
LANGS = ${MEMAD_LANGS}
endif
## run things with individual data sets only
%-fiskmo:
${MAKE} TRAINSET=fiskmo ${@:-fiskmo=}
%-opensubtitles:
${MAKE} TRAINSET=OpenSubtitles ${@:-opensubtitles=}
%-finlex:
${MAKE} TRAINSET=Finlex ${@:-finlex=}
## a batch of interesting models ....
## germanic to germanic
germanic:
${MAKE} LANGS="${GERMANIC}" HPC_DISK=1500 multilingual
scandinavian:
${MAKE} LANGS="${SCANDINAVIAN}" multilingual-medium
memad2en:
${MAKE} LANGS="${MEMAD_LANGS}" PIVOT=en all2pivot
fiet:
${MAKE} SRCLANGS=fi TRGLANGS=et bilingual-medium
icelandic:
${MAKE} SRCLANGS=is TRGLANGS=en bilingual
${MAKE} SRCLANGS=is TRGLANGS="da no nn nb sv" bilingual
${MAKE} SRCLANGS=is TRGLANGS=fi bilingual
enru-yandex:
${MAKE} DATASET=opus+yandex SRCLANGS=ru TRGLANGS=en EXTRA_TRAINSET=yandex data
${MAKE} DATASET=opus+yandex SRCLANGS=ru TRGLANGS=en EXTRA_TRAINSET=yandex reverse-data
${MAKE} DATASET=opus+yandex SRCLANGS=en TRGLANGS=ru EXTRA_TRAINSET=yandex \
WALLTIME=72 HPC_CORES=1 HPC_MEM=8g MARIAN_WORKSPACE=12000 train.submit-multigpu
${MAKE} DATASET=opus+yandex SRCLANGS=ru TRGLANGS=en EXTRA_TRAINSET=yandex \
WALLTIME=72 HPC_CORES=1 HPC_MEM=4g train.submit-multigpu
unidirectional:
${MAKE} data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
bilingual:
${MAKE} data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
${MAKE} reverse-data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
bilingual-medium:
${MAKE} data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
MARIAN_VALID_FREQ=5000 MARIAN_WORKSPACE=10000 train.submit
${MAKE} reverse-data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
MARIAN_VALID_FREQ=5000 MARIAN_WORKSPACE=10000 train.submit
bilingual-small:
${MAKE} data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
MARIAN_WORKSPACE=5000 MARIAN_VALID_FREQ=2500 train.submit
${MAKE} reverse-data
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
MARIAN_WORKSPACE=5000 MARIAN_VALID_FREQ=2500 train.submit
multilingual:
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" data
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" \
WALLTIME=72 HPC_CORES=1 HPC_MEM=4g train.submit-multigpu
multilingual-medium:
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" data
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" \
MARIAN_VALID_FREQ=5000 MARIAN_WORKSPACE=10000 \
WALLTIME=72 HPC_CORES=1 HPC_MEM=4g train.submit-multigpu
all2pivot:
for l in ${filter-out ${PIVOT},${LANGS}}; do \
${MAKE} SRCLANGS="$$l" TRGLANGS="${PIVOT}" data; \
${MAKE} SRCLANGS="$$l" TRGLANGS="${PIVOT}" HPC_CORES=1 HPC_MEM=4g train.submit-multigpu; \
${MAKE} SRCLANGS="$$l" TRGLANGS="${PIVOT}" reverse-data; \
${MAKE} SRCLANGS="${PIVOT}" TRGLANGS="$$l" HPC_CORES=1 HPC_MEM=4g train.submit-multigpu; \
done
bilingual-dynamic:
if [ ! -e "${WORKHOME}/${LANGSTR}/train.submit" ]; then \
${MAKE} data; \
if [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 10000000 ]; then \
echo "${LANGSTR} bigger than 10 million"; \
${MAKE} HPC_CORES=1 HPC_MEM=8g train.submit-multigpu; \
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
${MAKE} reverse-data-spm; \
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' HPC_CORES=1 HPC_MEM=8g train.submit-multigpu; \
fi; \
elif [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 1000000 ]; then \
echo "${LANGSTR} bigger than 1 million"; \
${MAKE} \
MARIAN_VALID_FREQ=2500 \
HPC_CORES=1 HPC_MEM=4g train.submit; \
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
${MAKE} reverse-data-spm; \
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' \
MARIAN_VALID_FREQ=2500 \
HPC_CORES=1 HPC_MEM=4g train.submit; \
fi; \
elif [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 100000 ]; then \
echo "${LANGSTR} bigger than 100k"; \
${MAKE} \
MARIAN_VALID_FREQ=1000 \
MARIAN_WORKSPACE=5000 \
MARIAN_VALID_MINI_BATCH=8 \
MARIAN_EARLY_STOPPING=5 \
HPC_CORES=1 HPC_MEM=4g train.submit; \
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
${MAKE} reverse-data-spm; \
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' \
MARIAN_VALID_FREQ=1000 \
MARIAN_WORKSPACE=5000 \
MARIAN_VALID_MINI_BATCH=8 \
MARIAN_EARLY_STOPPING=5 \
HPC_CORES=1 HPC_MEM=4g train.submit; \
fi; \
elif [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 10000 ]; then \
echo "${LANGSTR} bigger than 10k"; \
${MAKE} \
MARIAN_WORKSPACE=3500 \
MARIAN_VALID_MINI_BATCH=4 \
MARIAN_DROPOUT=0.5 \
MARIAN_VALID_FREQ=1000 \
MARIAN_EARLY_STOPPING=5 \
HPC_CORES=1 HPC_MEM=4g train.submit; \
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
${MAKE} reverse-data-spm; \
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' \
MARIAN_WORKSPACE=3500 \
MARIAN_VALID_MINI_BATCH=4 \
MARIAN_DROPOUT=0.5 \
MARIAN_VALID_FREQ=1000 \
MARIAN_EARLY_STOPPING=5 \
HPC_CORES=1 HPC_MEM=4g train.submit; \
fi; \
else \
echo "${LANGSTR} too small"; \
fi \
fi
# iso639 = aa ab ae af ak am an ar as av ay az ba be bg bh bi bm bn bo br bs ca ce ch cn co cr cs cu cv cy da de dv dz ee el en eo es et eu fa ff fi fj fo fr fy ga gd gl gn gr gu gv ha hb he hi ho hr ht hu hy hz ia id ie ig ik io is it iu ja jp jv ka kg ki kj kk kl km kn ko kr ks ku kv kw ky la lb lg li ln lo lt lu lv me mg mh mi mk ml mn mo mr ms mt my na nb nd ne ng nl nn no nr nv ny oc oj om or os pa pi pl po ps pt qu rm rn ro ru rw ry sa sc sd se sg sh si sk sl sm sn so sq sr ss st su sv sw ta tc te tg th ti tk tl tn to tr ts tt tw ty ua ug uk ur uz ve vi vo wa wo xh yi yo za zh zu
# NO_MEMAD = ${filter-out fi sv de fr nl,${iso639}}
#"de_AT de_CH de_DE de"
#"en_AU en_CA en_GB en_NZ en_US en_ZA en"
#"it_IT if"
#"es_AR es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_SV es_UY es_VE es"
#"eu_ES eu"
#"hi_IN hi"
#"fr_BE fr_CA fr_FR fr"
#"fa_AF fa_IR fa"
#"ar_SY ar_TN ar"
#"bn_IN bn"
#da_DK
#bg_BG
#nb_NO
#nl_BE nl_NL
#tr_TR
### ze_en - English subtitles in chinese movies
OPUSLANGS = fi sv fr es de ar he "cmn cn yue ze_zh zh_cn zh_CN zh_HK zh_tw zh_TW zh_yue zhs zht zh" "pt_br pt_BR pt_PT pt" aa ab ace ach acm acu ada ady aeb aed ae afb afh af agr aha aii ain ajg aka ake akl ak aln alt alz amh ami amu am ang an aoc aoz apc ara arc arh arn arq ary arz ase asf ast as ati atj avk av awa aym ay azb "az_IR az" bal bam ban bar bas ba bbc bbj bci bcl bem ber "be_tarask be" bfi bg bho bhw bh bin bi bjn bm bn bnt bo bpy brx br bsn bs btg bts btx bua bug bum bvl bvy bxr byn byv bzj bzs cab cac cak cat cay ca "cbk_zam cbk" cce cdo ceb ce chf chj chk cho chq chr chw chy ch cjk cjp cjy ckb ckt cku cmo cnh cni cop co "crh_latn crh" crp crs cr csb cse csf csg csl csn csr cs cto ctu cuk cu cv cycl cyo cy daf da dga dhv dik din diq dje djk dng dop dsb dtp dty dua dv dws dyu dz ecs ee efi egl el eml enm eo esn et eu ewo ext fan fat fa fcs ff fil fj fkv fon foo fo frm frp frr fse fsl fuc ful fur fuv fy gaa gag gan ga gbi gbm gcf gcr gd gil glk gl gn gom gor gos got grc gr gsg gsm gss gsw guc gug gum gur guw gu gv gxx gym hai hak hau haw ha haz hb hch hds hif hi hil him hmn hne hnj hoc ho hrx hr hsb hsh hsn ht hup hus hu hyw hy hz ia iba ibg ibo id ie ig ike ik ilo inh inl ins io iro ise ish iso is it iu izh jak jam jap ja jbo jdt jiv jmx jp jsl jv kaa kab kac kam kar kau ka kbd kbh kbp kea kek kg kha kik kin ki kjh kj kk kl kmb kmr km kn koi kok kon koo ko kpv kqn krc kri krl kr ksh kss ksw ks kum ku kvk kv kwn kwy kw kxi ky kzj lad lam la lbe lb ldn lez lfn lg lij lin liv li lkt lld lmo ln lou lo loz lrc lsp ltg lt lua lue lun luo lus luy lu lv lzh lzz mad mai mam map_bms mau max maz mco mcp mdf men me mfe mfs mgm mgr mg mhr mh mic min miq mi mk mlg ml mnc mni mnw mn moh mos mo mrj mrq mr "ms_MY ms" mt mus mvv mwl mww mxv myv my mzn mzy nah nan nap na nba "nb_NO nb nn_NO nn nog no_nb no" nch nci ncj ncs ncx ndc "nds_nl nds" nd new ne ngl ngt ngu ng nhg nhk nhn nia nij niu nlv nl nnh non nov npi nqo nrm nr nso nst nv nya nyk nyn nyu ny nzi oar oc ojb oj oke olo om orm orv or osx os ota ote otk pag pam pan pap pau pa pbb pcd pck pcm pdc pdt pes pfl pid pih pis pi plt pl pms pmy pnb pnt pon pot po ppk ppl prg prl prs pso psp psr ps pys quc que qug qus quw quy qu quz qvi qvz qya rap rar rcf rif rmn rms rmy rm rnd rn rom ro rsl rue run rup ru rw ry sah sat sa sbs scn sco sc sd seh se sfs sfw sgn sgs sg shi shn shs shy sh sid simple si sjn sk sl sma sml sm sna sn som son sop sot so sqk sq "sr_ME sr srp" srm srn ssp ss stq st sux su svk swa swc swg swh sw sxn syr szl "ta_LK ta" tcf tcy tc tdt tdx tet te "tg_TJ tg" thv th tig tir tiv ti tkl tk tlh tll "tl_PH tl" tly tmh tmp tmw tn tob tog toh toi toj toki top to tpi tpw trv tr tsc tss ts tsz ttj tt tum tvl tw tyv ty tzh tzl tzo udm ug uk umb urh "ur_PK ur" usp uz vec vep ve "vi_VN vi" vls vmw vo vro vsl wae wal war wa wba wes wls wlv wol wo wuu xal xho xh xmf xpe yao yap yaq ybb yi yor yo yua zab zai zam za zdj zea zib zlm zne zpa zpg zsl zsm "zul zu" zza
allopus2pivot:
for l in ${filter-out ${PIVOT},${OPUSLANGS}}; do \
${MAKE} WALLTIME=72 SRCLANGS="$$l" bilingual-dynamic; \
done
## this looks dangerous ....
allopus:
for s in ${OPUSLANGS}; do \
for t in ${OPUSLANGS}; do \
if [ ! -e "${WORKHOME}/$$s-$$t/train.submit" ]; then \
echo "${MAKE} WALLTIME=72 SRCLANGS=\"$$s\" SRCLANGS=\"$$t\" bilingual-dynamic"; \
${MAKE} WALLTIME=72 SRCLANGS="$$s" TRGLANGS="$$t" bilingual-dynamic; \
fi \
done \
done
all2en:
${MAKE} PIVOT=en allopus2pivot
enit:
${MAKE} SRCLANGS=en TRGLANGS=it traindata-spm
${MAKE} SRCLANGS=en TRGLANGS=it devdata-spm
${MAKE} SRCLANGS=en TRGLANGS=it wordalign-spm
${MAKE} SRCLANGS=en TRGLANGS=it WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
memad-fiensv:
${MAKE} SRCLANGS=sv TRGLANGS=fi traindata-spm
${MAKE} SRCLANGS=sv TRGLANGS=fi devdata-spm
${MAKE} SRCLANGS=sv TRGLANGS=fi wordalign-spm
${MAKE} SRCLANGS=sv TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
${MAKE} SRCLANGS=sv TRGLANGS=fi reverse-data-spm
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
${MAKE} SRCLANGS=en TRGLANGS=fi traindata-spm
${MAKE} SRCLANGS=en TRGLANGS=fi devdata-spm
${MAKE} SRCLANGS=en TRGLANGS=fi wordalign-spm
${MAKE} SRCLANGS=en TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
${MAKE} SRCLANGS=en TRGLANGS=fi reverse-data-spm
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
memad250-fiensv:
${MAKE} CONTEXT_SIZE=250 memad-fiensv_doc
memad-fiensv_doc:
${MAKE} SRCLANGS=sv TRGLANGS=fi traindata-doc
${MAKE} SRCLANGS=sv TRGLANGS=fi devdata-doc
${MAKE} SRCLANGS=sv TRGLANGS=fi WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
${MAKE} SRCLANGS=sv TRGLANGS=fi reverse-data-doc
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
${MAKE} SRCLANGS=en TRGLANGS=fi traindata-doc
${MAKE} SRCLANGS=en TRGLANGS=fi devdata-doc
${MAKE} SRCLANGS=en TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-doc.submit-multigpu
${MAKE} SRCLANGS=en TRGLANGS=fi reverse-data-doc
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-doc.submit-multigpu
memad-fiensv_more:
${MAKE} SRCLANGS=sv TRGLANGS=fi traindata-doc
${MAKE} SRCLANGS=sv TRGLANGS=fi devdata-doc
${MAKE} SRCLANGS=sv TRGLANGS=fi WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
${MAKE} SRCLANGS=sv TRGLANGS=fi reverse-data-doc
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
${MAKE} CONTEXT_SIZE=500 memad-fiensv_doc
memad:
for s in fi en sv de fr nl; do \
for t in en fi sv de fr nl; do \
if [ "$$s" != "$$t" ]; then \
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata devdata wordalign; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g train.submit-multigpu; \
fi \
fi \
done \
done
doclevel:
${MAKE} ost-datasets
${MAKE} traindata-doc-ost
${MAKE} devdata-doc-ost
${MAKE} wordalign-doc-ost
${MAKE} CONTEXT_SIZE=${CONTEXT_SIZE} MODELTYPE=${MODELTYPE} \
HPC_CORES=1 WALLTIME=72 HPC_MEM=4g train-doc-ost.submit
fiensv_bpe:
${MAKE} SRCLANGS=fi TRGLANGS=sv traindata-bpe
${MAKE} SRCLANGS=fi TRGLANGS=sv devdata-bpe
${MAKE} SRCLANGS=fi TRGLANGS=sv wordalign-bpe
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-bpe.submit-multigpu
${MAKE} SRCLANGS=fi TRGLANGS=en traindata-bpe
${MAKE} SRCLANGS=fi TRGLANGS=en devdata-bpe
${MAKE} SRCLANGS=fi TRGLANGS=en wordalign-bpe
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-bpe.submit-multigpu
fiensv_spm:
${MAKE} SRCLANGS=fi TRGLANGS=sv traindata-spm
${MAKE} SRCLANGS=fi TRGLANGS=sv devdata-spm
${MAKE} SRCLANGS=fi TRGLANGS=sv wordalign-spm
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
${MAKE} SRCLANGS=fi TRGLANGS=en traindata-spm
${MAKE} SRCLANGS=fi TRGLANGS=en devdata-spm
${MAKE} SRCLANGS=fi TRGLANGS=en wordalign-spm
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
fifr_spm:
${MAKE} SRCLANGS=fr TRGLANGS=fi traindata-spm
${MAKE} SRCLANGS=fr TRGLANGS=fi devdata-spm
${MAKE} SRCLANGS=fr TRGLANGS=fi wordalign-spm
${MAKE} SRCLANGS=fr TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
${MAKE} SRCLANGS=fr TRGLANGS=fi reverse-data-spm
${MAKE} SRCLANGS=fi TRGLANGS=fr WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
fifr_doc:
${MAKE} SRCLANGS=fr TRGLANGS=fi traindata-doc
${MAKE} SRCLANGS=fr TRGLANGS=fi devdata-doc
${MAKE} SRCLANGS=fr TRGLANGS=fi WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
${MAKE} SRCLANGS=fr TRGLANGS=fi reverse-data-doc
${MAKE} SRCLANGS=fi TRGLANGS=fr WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
fide_spm:
${MAKE} SRCLANGS=de TRGLANGS=fi traindata-spm
${MAKE} SRCLANGS=de TRGLANGS=fi devdata-spm
${MAKE} SRCLANGS=de TRGLANGS=fi wordalign-spm
${MAKE} SRCLANGS=de TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
${MAKE} SRCLANGS=de TRGLANGS=fi reverse-data-spm
${MAKE} SRCLANGS=fi TRGLANGS=de WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
memad_spm:
for s in fi en sv de fr nl; do \
for t in en fi sv de fr nl; do \
if [ "$$s" != "$$t" ]; then \
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata-spm; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t devdata-spm; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t wordalign-spm; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g train-spm.submit-multigpu; \
fi \
fi \
done \
done
memad_doc:
for s in fi en sv; do \
for t in en fi sv; do \
if [ "$$s" != "$$t" ]; then \
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata-doc; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t devdata-doc; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g MODELTYPE=transformer train-doc.submit-multigpu; \
fi \
fi \
done \
done
memad_docalign:
for s in fi en sv; do \
for t in en fi sv; do \
if [ "$$s" != "$$t" ]; then \
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata-doc; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t devdata-doc; \
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g train-doc.submit-multigpu; \
fi \
fi \
done \
done
enfisv:
${MAKE} SRCLANGS="en fi sv" TRGLANGS="en fi sv" traindata devdata wordalign
${MAKE} SRCLANGS="en fi sv" TRGLANGS="en fi sv" HPC_MEM=4g WALLTIME=72 HPC_CORES=1 train.submit-multigpu
en-fiet:
${MAKE} SRCLANGS="en" TRGLANGS="et fi" traindata devdata
${MAKE} SRCLANGS="en" TRGLANGS="et fi" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
${MAKE} TRGLANGS="en" SRCLANGS="et fi" traindata devdata
${MAKE} TRGLANGS="en" SRCLANGS="et fi" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
memad-multi:
for s in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
${MAKE} SRCLANGS="$$s" TRGLANGS="$$s" traindata devdata; \
${MAKE} SRCLANGS="$$s" TRGLANGS="$$s" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
done
for s in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
for t in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
if [ "$$s" != "$$t" ]; then \
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" traindata devdata; \
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
fi \
done \
done
memad-multi2:
for s in "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
for t in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
if [ "$$s" != "$$t" ]; then \
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" traindata devdata; \
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
fi \
done \
done
memad-multi3:
for s in "${SCANDINAVIAN}" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
${MAKE} SRCLANGS="$$s" TRGLANGS="en" traindata devdata; \
${MAKE} SRCLANGS="$$s" TRGLANGS="en" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
${MAKE} SRCLANGS="en" TRGLANGS="$$s" traindata devdata; \
${MAKE} SRCLANGS="en" TRGLANGS="$$s" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
done
${MAKE} SRCLANGS="en" TRGLANGS="fr" traindata devdata
${MAKE} SRCLANGS="en" TRGLANGS="fr" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
${MAKE} SRCLANGS="fr" TRGLANGS="en" traindata devdata
${MAKE} SRCLANGS="fr" TRGLANGS="en" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
memad-fi:
for l in en sv de fr; do \
${MAKE} SRCLANGS=$$l TRGLANGS=fi traindata devdata; \
${MAKE} SRCLANGS=$$l TRGLANGS=fi HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
${MAKE} TRGLANGS=$$l SRCLANGS=fi traindata devdata; \
${MAKE} TRGLANGS=$$l SRCLANGS=fi HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
done
nordic:
${MAKE} SRCLANGS="${SCANDINAVIAN}" TRGLANGS="${FINNO_UGRIC}" traindata
${MAKE} SRCLANGS="${SCANDINAVIAN}" TRGLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
${MAKE} TRGLANGS="${SCANDINAVIAN}" SRCLANGS="${FINNO_UGRIC}" traindata
${MAKE} TRGLANGS="${SCANDINAVIAN}" SRCLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
romance:
${MAKE} SRCLANGS="${ROMANCE}" TRGLANGS="${FINNO_UGRIC}" traindata
${MAKE} SRCLANGS="${ROMANCE}" TRGLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
${MAKE} TRGLANGS="${ROMANCE}" SRCLANGS="${FINNO_UGRIC}" traindata
${MAKE} TRGLANGS="${ROMANCE}" SRCLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
westgermanic:
${MAKE} SRCLANGS="${WESTGERMANIC}" TRGLANGS="${FINNO_UGRIC}" traindata
${MAKE} SRCLANGS="${WESTGERMANIC}" TRGLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
${MAKE} TRGLANGS="${WESTGERMANIC}" SRCLANGS="${FINNO_UGRIC}" traindata
${MAKE} TRGLANGS="${WESTGERMANIC}" SRCLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
germanic-romance:
${MAKE} SRCLANGS="${ROMANCE}" \
TRGLANGS="${GERMANIC}" traindata
${MAKE} HPC_MEM=4g HPC_CORES=1 SRCLANGS="${ROMANCE}" \
TRGLANGS="${GERMANIC}" train.submit-multigpu
${MAKE} TRGLANGS="${ROMANCE}" \
SRCLANGS="${GERMANIC}" traindata devdata
${MAKE} HPC_MEM=4g HPC_CORES=1 TRGLANGS="${ROMANCE}" \
SRCLANGS="${GERMANIC}" train.submit-multigpu

54
README.md Normal file
View File

@ -0,0 +1,54 @@
# Train Opus-MT models
This folder includes make targets for training NMT models using MarianNMT and OPUS data. More details are given in the [Makefile](Makefile) but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
## Structure
Essential files for making new models:
* `Makefile`: top-level makefile
* `Makefile.env`: system-specific environment (now based on CSC machines)
* `Makefile.config`: essential model configuration
* `Makefile.data`: data pre-processing tasks
* `Makefile.doclevel`: experimental document-level models
* `Makefile.tasks`: tasks for training specific models and other things (this frequently changes)
* `Makefile.dist`: make packages for distributing models (CSC ObjectStorage based)
* `Makefile.slurm`: submit jobs with SLURM
Run this if you want to train a model, for example for translating English to French:
```
make SRCLANG=en TRGLANG=fr train
```
To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:
```
make SRCLANG=en TRGLANG=fr eval
```
For multilingual (more than one language on either side) models run, for example:
```
make SRCLANG="de en" TRGLANG="fr es pt" train
make SRCLANG="de en" TRGLANG="fr es pt" eval
```
Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:
```
make -j 8 SRCLANG=en TRGLANG=fr data
```
## Upload to Object Storage
```
swift upload OPUS-MT --changed --skip-identical name-of-file
swift post OPUS-MT --read-acl ".r:*"
```

8
TODO.md Normal file
View File

@ -0,0 +1,8 @@
# Things to do
* add backtranslations to training data
* can use monolingual data from tokenized wikipedia dumps: https://sites.google.com/site/rmyeid/projects/polyglot
* https://dumps.wikimedia.org/backup-index.html
* better in JSON: https://dumps.wikimedia.org/other/cirrussearch/current/

115
large-context.pl Executable file
View File

@ -0,0 +1,115 @@
#!/bin/env perl
use strict;
use vars qw($opt_l);
use Getopt::Std;
getopts('l:');
my $max = $opt_l || 100;
my $srcfile = shift(@ARGV);
my $trgfile = shift(@ARGV);
my $algfile = shift(@ARGV);
if ($srcfile=~/\.gz$/){
open S,"gzip -cd <$srcfile |" || die "cannot open $srcfile";
}
else{ open S,"<$srcfile" || die "cannot open $srcfile"; }
if ($trgfile=~/\.gz$/){
open T,"gzip -cd <$trgfile |" || die "cannot open $trgfile";
}
else{ open T,"<$trgfile" || die "cannot open $trgfile"; }
if ($algfile=~/\.gz$/){
open A,"gzip -cd <$algfile |" || die "cannot open $algfile";
}
else{ open A,"<$algfile" || die "cannot open $algfile"; }
binmode(S,":utf8");
binmode(T,":utf8");
binmode(STDOUT,":utf8");
my $srcdoc = '<BEG> ';
my $trgdoc = '<BEG> ';
my $algdoc = '0-0';
my $srccount = 0;
my $trgcount = 0;
my $segcount = 0;
while (<S>){
chomp;
my $trg = <T>;
my $alg = <A>;
chomp($trg);
chomp($alg);
my @srctok = split(/\s+/);
my @trgtok = split(/\s+/,$trg);
if ( ($srccount+@srctok > $max) || ($trgcount+@trgtok > $max) ){
$srcdoc .= '<BRK>';
$trgdoc .= '<BRK>';
$algdoc .= ' ';
$algdoc .= $srccount+$segcount+1;
$algdoc .= '-';
$algdoc .= $trgcount+$segcount+1;
print $srcdoc,"\t",$trgdoc,"\t",$algdoc,"\n";
$srcdoc = '<CNT> ';
$trgdoc = '<CNT> ';
$algdoc = '0-0';
$srccount = 0;
$trgcount = 0;
$segcount = 0;
}
if ( @srctok == 0 && @trgtok == 0 ){
$srcdoc .= '<END>';
$trgdoc .= '<END>';
$algdoc .= ' ';
$algdoc .= $srccount+$segcount+1;
$algdoc .= '-';
$algdoc .= $trgcount+$segcount+1;
print $srcdoc,"\t",$trgdoc,"\t",$algdoc,"\n";
$srcdoc = '<BEG> ';
$trgdoc = '<BEG> ';
$algdoc = '0-0';
$srccount = 0;
$trgcount = 0;
$segcount = 0;
next;
}
$srcdoc .= join(' ',@srctok);
$trgdoc .= join(' ',@trgtok);
$algdoc .= adjust_alignment($alg,$srccount,$trgcount,$segcount);
$srcdoc .= ' <SEP> ';
$trgdoc .= ' <SEP> ';
$srccount += @srctok;
$trgcount += @trgtok;
$segcount++;
$algdoc .= ' ';
$algdoc .= $srccount+$segcount;
$algdoc .= '-';
$algdoc .= $trgcount+$segcount;
}
if ($srcdoc || $trgdoc){
print $srcdoc,"\t",$trgdoc,"\n";
}
sub adjust_alignment{
my ($alg,$srccount,$trgcount,$segcount) = @_;
my @links = split(/\s+/,$alg);
my @newLinks = ();
foreach my $l (@links){
my ($s,$t) = split(/\-/,$l);
$s += $srccount+$segcount+1;
$t += $trgcount+$segcount+1;
push(@newLinks,$s.'-'.$t);
}
return ' '.join(' ',@newLinks) if (@newLinks);
return '';
}

29
models/Makefile Normal file
View File

@ -0,0 +1,29 @@
MODELS = ${shell find . -type f -name '*.zip'}
## fix decoder.yml to match the typical setup
## and the names of the model and vocab in the zip file
fix-config:
for m in ${MODELS}; do \
f=`unzip -l $$m | grep -oi '[^ ]*npz'`; \
v=`unzip -l $$m | grep -oi '[^ ]*vocab.yml'`; \
echo 'models:' > decoder.yml; \
echo " - $$f" >> decoder.yml; \
echo 'vocabs:' >> decoder.yml; \
echo " - $$v" >> decoder.yml; \
echo " - $$v" >> decoder.yml; \
echo 'beam-size: 6' >> decoder.yml; \
echo 'normalize: 1' >> decoder.yml; \
echo 'word-penalty: 0' >> decoder.yml; \
echo 'mini-batch: 1' >> decoder.yml; \
echo 'maxi-batch: 1' >> decoder.yml; \
echo 'maxi-batch-sort: src' >> decoder.yml; \
echo 'relative-paths: true' >> decoder.yml; \
zip $$m decoder.yml; \
done
rm -f decoder.yml

30
models/af-en/README.md Normal file
View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.af.en | 55.6 | 0.664 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.af.en | 60.8 | 0.736 |

15
models/af-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.af.fi | 32.3 | 0.576 |

15
models/af-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.af.fr | 35.3 | 0.543 |

15
models/af-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.af.sv | 40.4 | 0.599 |

15
models/am-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/am-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.am.en | 23.5 | 0.492 |

15
models/am-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/am-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.am.sv | 21.0 | 0.377 |

30
models/ar-en/README.md Normal file
View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ar.en | 46.7 | 0.620 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ar.en | 49.4 | 0.661 |

15
models/ar-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ar.fr | 43.2 | 0.600 |

15
models/as-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/as-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/as-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/as-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.as.en | 89.3 | 0.901 |

15
models/az-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/az-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/az-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/az-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.az.en | 30.4 | 0.564 |

15
models/bcl-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bcl.fi | 33.3 | 0.573 |

15
models/bcl-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bcl.fr | 35.0 | 0.527 |

15
models/bcl-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bcl-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bcl.sv | 38.0 | 0.565 |

15
models/bem-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bem.en | 33.4 | 0.491 |

15
models/bem-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bem.fi | 22.8 | 0.439 |

15
models/bem-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bem.fr | 25.0 | 0.417 |

15
models/bem-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bem.sv | 25.6 | 0.434 |

15
models/ber-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ber-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ber.en | 37.3 | 0.566 |

15
models/ber-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ber-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ber.fr | 60.2 | 0.754 |

30
models/bg-en/README.md Normal file
View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.bg.en | 61.6 | 0.718 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.bg.en | 59.4 | 0.727 |

15
models/bg-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bg.fi | 23.7 | 0.505 |

15
models/bg-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| GlobalVoices.bg.fr | 20.9 | 0.480 |

15
models/bg-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bg.sv | 29.1 | 0.494 |

30
models/bn-en/README.md Normal file
View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.bn.en | 53.3 | 0.639 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.bn.en | 49.8 | 0.644 |

15
models/br-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/br-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/br-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/br-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.br.en | 86.3 | 0.917 |

15
models/bs-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/bs-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bs-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bs-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.bs.en | 33.3 | 0.536 |

15
models/bzs-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bzs.en | 44.5 | 0.605 |

15
models/bzs-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bzs.fi | 24.7 | 0.464 |

15
models/bzs-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bzs.fr | 30.0 | 0.479 |

15
models/bzs-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.bzs.sv | 30.7 | 0.489 |

View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca+es+fr+ga+it+la+oc+pt_br+pt-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca+es+fr+ga+it+la+oc+pt_br+pt-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca+es+fr+ga+it+la+oc+pt_br+pt-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newssyscomb2009.es.fr | 29.6 | 0.561 |
| newssyscomb2009.fr.es | 30.3 | 0.561 |
| news-test2008.es.fr | 27.9 | 0.538 |
| news-test2008.fr.es | 28.6 | 0.538 |
| newstest2009.es.fr | 26.3 | 0.537 |
| newstest2009.fr.es | 27.1 | 0.537 |
| newstest2010.es.fr | 30.2 | 0.563 |
| newstest2010.fr.es | 30.9 | 0.563 |
| newstest2011.es.fr | 29.3 | 0.552 |
| newstest2011.fr.es | 30.1 | 0.552 |
| newstest2012.es.fr | 29.5 | 0.553 |
| newstest2012.fr.es | 30.2 | 0.553 |
| newstest2013.es.fr | 27.6 | 0.536 |
| newstest2013.fr.es | 28.4 | 0.536 |
| Tatoeba.ca.pt | 50.7 | 0.659 |

15
models/ca-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ca.en | 51.4 | 0.678 |

15
models/ca-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.ca.fr | 50.4 | 0.672 |

15
models/ceb-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ceb.en | 52.6 | 0.670 |

15
models/ceb-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ceb.fi | 27.4 | 0.525 |

15
models/ceb-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ceb.fr | 30.0 | 0.491 |

15
models/ceb-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ceb.sv | 35.5 | 0.552 |

15
models/chk-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/chk-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.chk.en | 31.2 | 0.465 |

15
models/chk-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/chk-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.chk.fr | 22.4 | 0.387 |

15
models/chk-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/chk-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.chk.sv | 23.6 | 0.406 |

15
models/crs-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.crs.en | 42.9 | 0.589 |

15
models/crs-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.crs.fi | 25.6 | 0.479 |

15
models/crs-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.crs.fr | 29.4 | 0.475 |

15
models/crs-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.crs.sv | 29.3 | 0.480 |

40
models/cs-en/README.md Normal file
View File

@ -0,0 +1,40 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newstest2014-csen.cs.en | 31.5 | 0.589 |
| newstest2015-encs.cs.en | 27.5 | 0.540 |
| newstest2016-encs.cs.en | 28.5 | 0.561 |
| newstest2017-encs.cs.en | 26.6 | 0.540 |
| newstest2018-encs.cs.en | 27.1 | 0.540 |
| Tatoeba.cs.en | 62.5 | 0.743 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newstest2014-csen.cs.en | 34.1 | 0.612 |
| newstest2015-encs.cs.en | 30.4 | 0.565 |
| newstest2016-encs.cs.en | 31.8 | 0.584 |
| newstest2017-encs.cs.en | 28.7 | 0.556 |
| newstest2018-encs.cs.en | 30.3 | 0.566 |
| Tatoeba.cs.en | 58.0 | 0.721 |

15
models/cs-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.cs.fi | 25.5 | 0.523 |

15
models/cs-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| GlobalVoices.cs.fr | 21.0 | 0.488 |

15
models/cs-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.cs.sv | 30.6 | 0.527 |

30
models/cy-en/README.md Normal file
View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.cy.en | 41.8 | 0.597 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.cy.en | 33.0 | 0.525 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.pt | 52.1 | 0.684 |

View File

@ -0,0 +1,32 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.sv | 70.7 | 0.824 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.sv | 69.2 | 0.811 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.nl | 51.6 | 0.690 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.fy | 50.3 | 0.687 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en+fr/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en+fr/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en+fr/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.fr | 61.3 | 0.736 |

View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.en | 61.3 | 0.743 |

View File

@ -0,0 +1,17 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+fi/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+fi/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+fi/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| fiskmo_testset.sv.fi | 22.4 | 0.590 |
| Tatoeba.da.fi | 40.4 | 0.637 |

View File

@ -0,0 +1,17 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+hu+fi/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+hu+fi/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+hu+fi/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| fiskmo_testset.sv.fi | 22.9 | 0.583 |
| Tatoeba.da.fi | 39.8 | 0.632 |

View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.et | 37.8 | 0.592 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fi/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fi/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fi/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| fiskmo_testset.sv.fi | 25.7 | 0.605 |
| Tatoeba.da.fi | 41.7 | 0.643 |

View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fr/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fr/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fr/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.fr | 63.4 | 0.732 |

30
models/da-en/README.md Normal file
View File

@ -0,0 +1,30 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.en | 65.1 | 0.774 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.en | 63.6 | 0.769 |

15
models/da-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.fi | 39.0 | 0.629 |

15
models/da-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.da.fr | 62.2 | 0.751 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-de+af+fy+nl/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-de+af+fy+nl/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-de+af+fy+nl/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.nl | 51.2 | 0.681 |

View File

@ -0,0 +1,28 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newssyscomb2009.de.en | 26.1 | 0.529 |
| news-test2008.de.en | 24.4 | 0.515 |
| newstest2009.de.en | 23.9 | 0.514 |
| newstest2010.de.en | 26.9 | 0.547 |
| newstest2011.de.en | 25.0 | 0.521 |
| newstest2012.de.en | 26.6 | 0.531 |
| newstest2013.de.en | 28.8 | 0.546 |
| newstest2014-deen.de.en | 28.3 | 0.547 |
| newstest2015-ende.de.en | 29.0 | 0.548 |
| newstest2016-ende.de.en | 34.0 | 0.595 |
| newstest2017-ende.de.en | 29.4 | 0.559 |
| newstest2018-ende.de.en | 36.5 | 0.605 |
| newstest2019-deen.de.en | 33.6 | 0.584 |
| Tatoeba.de.en | 42.0 | 0.631 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-et+fi/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-et+fi/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-et+fi/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.fi | 38.8 | 0.612 |

View File

@ -0,0 +1,24 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-fr/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-fr/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-fr/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| euelections_dev2019.transformer.de | 22.2 | 0.530 |
| newssyscomb2009.de.fr | 18.6 | 0.500 |
| news-test2008.de.fr | 18.4 | 0.495 |
| newstest2009.de.fr | 18.2 | 0.486 |
| newstest2010.de.fr | 19.9 | 0.514 |
| newstest2011.de.fr | 18.8 | 0.493 |
| newstest2012.de.fr | 19.5 | 0.495 |
| newstest2013.de.fr | 20.1 | 0.498 |
| newstest2019-defr.de.fr | 24.3 | 0.554 |
| Tatoeba.de.fr | 29.4 | 0.570 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-de+nl+af+fy/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-de+nl+af+fy/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-de+nl+af+fy/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.fy | 51.7 | 0.691 |

View File

@ -0,0 +1,28 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newssyscomb2009.de.en | 25.6 | 0.525 |
| news-test2008.de.en | 24.2 | 0.515 |
| newstest2009.de.en | 23.7 | 0.513 |
| newstest2010.de.en | 26.8 | 0.548 |
| newstest2011.de.en | 24.9 | 0.522 |
| newstest2012.de.en | 26.3 | 0.529 |
| newstest2013.de.en | 28.8 | 0.546 |
| newstest2014-deen.de.en | 28.3 | 0.548 |
| newstest2015-ende.de.en | 28.9 | 0.549 |
| newstest2016-ende.de.en | 33.9 | 0.595 |
| newstest2017-ende.de.en | 29.4 | 0.558 |
| newstest2018-ende.de.en | 36.2 | 0.603 |
| newstest2019-deen.de.en | 33.8 | 0.585 |
| Tatoeba.de.en | 41.6 | 0.626 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.sv | 48.1 | 0.663 |

56
models/de-en/README.md Normal file
View File

@ -0,0 +1,56 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newssyscomb2009.de.en | 26.9 | 0.535 |
| news-test2008.de.en | 24.8 | 0.519 |
| newstest2009.de.en | 24.6 | 0.519 |
| newstest2010.de.en | 27.5 | 0.552 |
| newstest2011.de.en | 25.6 | 0.526 |
| newstest2012.de.en | 27.1 | 0.535 |
| newstest2013.de.en | 29.4 | 0.551 |
| newstest2014-deen.de.en | 29.1 | 0.553 |
| newstest2015-ende.de.en | 29.7 | 0.556 |
| newstest2016-ende.de.en | 34.8 | 0.600 |
| newstest2017-ende.de.en | 29.9 | 0.563 |
| newstest2018-ende.de.en | 37.4 | 0.611 |
| newstest2019-deen.de.en | 34.3 | 0.587 |
| Tatoeba.de.en | 54.5 | 0.677 |
# opus-2019-12-18.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-18.zip)
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-18.test.txt)
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-18.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newssyscomb2009.de.en | 28.6 | 0.553 |
| news-test2008.de.en | 27.6 | 0.547 |
| newstest2009.de.en | 26.9 | 0.544 |
| newstest2010.de.en | 30.4 | 0.585 |
| newstest2011.de.en | 27.5 | 0.554 |
| newstest2012.de.en | 29.0 | 0.567 |
| newstest2013.de.en | 32.2 | 0.583 |
| newstest2014-deen.de.en | 33.8 | 0.596 |
| newstest2015-ende.de.en | 34.3 | 0.598 |
| newstest2016-ende.de.en | 40.1 | 0.646 |
| newstest2017-ende.de.en | 35.6 | 0.609 |
| newstest2018-ende.de.en | 43.8 | 0.667 |
| newstest2019-deen.de.en | 39.6 | 0.637 |
| Tatoeba.de.en | 55.1 | 0.704 |

63
models/de-fi/README.md Normal file
View File

@ -0,0 +1,63 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.fi | 40.1 | 0.624 |
# goethe-2019-11-15.zip
* dataset: opus+goethe
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [goethe-2019-11-15.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/goethe-2019-11-15.zip)
* info: trained on OPUS and fine-tuned for 6 epochs on data from the Goethe Institute
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| goethe.de.fi | 39.26 | |
# goethe-2020-01-07.zip
* dataset: opus+goethe
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [goethe-2019-11-15.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/goethe-2020-01-07.zip)
* info: trained on OPUS and fine-tuned for 3 epochs on data from the Goethe Institute without duplicates
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| goethe.de.fi | 38.57 | |
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.fi | 40.0 | 0.628 |

48
models/de-fr/README.md Normal file
View File

@ -0,0 +1,48 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| euelections_dev2019.transformer.de | 22.5 | 0.531 |
| newssyscomb2009.de.fr | 18.8 | 0.500 |
| news-test2008.de.fr | 18.4 | 0.494 |
| newstest2009.de.fr | 18.4 | 0.487 |
| newstest2010.de.fr | 20.2 | 0.517 |
| newstest2011.de.fr | 18.9 | 0.494 |
| newstest2012.de.fr | 19.6 | 0.497 |
| newstest2013.de.fr | 20.4 | 0.502 |
| newstest2019-defr.de.fr | 24.5 | 0.557 |
| Tatoeba.de.fr | 54.8 | 0.666 |
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| euelections_dev2019.transformer-align.de | 32.2 | 0.590 |
| newssyscomb2009.de.fr | 26.8 | 0.553 |
| news-test2008.de.fr | 26.4 | 0.548 |
| newstest2009.de.fr | 25.6 | 0.539 |
| newstest2010.de.fr | 29.1 | 0.572 |
| newstest2011.de.fr | 26.9 | 0.551 |
| newstest2012.de.fr | 27.7 | 0.554 |
| newstest2013.de.fr | 29.5 | 0.560 |
| newstest2019-defr.de.fr | 36.6 | 0.625 |
| Tatoeba.de.fr | 49.2 | 0.664 |

15
models/de-nl/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.nl | 52.6 | 0.697 |

15
models/de-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-sv/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-sv/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-sv/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.de.sv | 55.0 | 0.699 |

16
models/ee-en/README.md Normal file
View File

@ -0,0 +1,16 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ee-en/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-en/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-en/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ee.en | 39.3 | 0.556 |
| Tatoeba.ee.en | 21.2 | 0.569 |

15
models/ee-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ee-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ee.fr | 27.1 | 0.450 |

15
models/ee-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ee-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.ee.sv | 28.9 | 0.472 |

15
models/efi-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/efi-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.efi.fi | 23.6 | 0.450 |

15
models/efi-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/efi-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.efi.sv | 26.8 | 0.447 |

15
models/el-en/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-en/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-en/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-en/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.el.en | 69.4 | 0.801 |

15
models/el-fi/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-fi/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fi/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fi/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| JW300.el.fi | 25.3 | 0.517 |

15
models/el-fr/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-fr/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fr/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fr/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.el.fr | 63.0 | 0.741 |

15
models/el-sv/README.md Normal file
View File

@ -0,0 +1,15 @@
# opus-2020-01-08.zip
* dataset: opus
* model: transformer-align
* pre-processing: normalization + SentencePiece
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-sv/opus-2020-01-08.zip)
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-sv/opus-2020-01-08.test.txt)
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-sv/opus-2020-01-08.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| GlobalVoices.el.sv | 23.6 | 0.498 |

View File

@ -0,0 +1,41 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/en+de+nl+fy+af+da+fo+is+no+nb+nn+sv-en+de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+de+nl+fy+af+da+fo+is+no+nb+nn+sv-en+de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+de+nl+fy+af+da+fo+is+no+nb+nn+sv-en+de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| newssyscomb2009.de.en | 19.9 | 0.505 |
| newssyscomb2009.en.de | 19.9 | 0.505 |
| news-test2008.de.en | 20.1 | 0.494 |
| news-test2008.en.de | 20.1 | 0.494 |
| newstest2009.de.en | 19.1 | 0.496 |
| newstest2009.en.de | 19.1 | 0.496 |
| newstest2010.de.en | 21.0 | 0.506 |
| newstest2010.en.de | 21.0 | 0.506 |
| newstest2011.de.en | 19.3 | 0.486 |
| newstest2011.en.de | 19.3 | 0.486 |
| newstest2012.de.en | 19.6 | 0.487 |
| newstest2012.en.de | 19.6 | 0.487 |
| newstest2013.de.en | 23.0 | 0.512 |
| newstest2013.en.de | 23.0 | 0.512 |
| newstest2014-deen.de.en | 26.3 | 0.535 |
| newstest2015-ende.de.en | 26.6 | 0.540 |
| newstest2015-ende.en.de | 26.6 | 0.540 |
| newstest2016-ende.de.en | 29.1 | 0.569 |
| newstest2016-ende.en.de | 29.1 | 0.569 |
| newstest2017-ende.de.en | 24.5 | 0.528 |
| newstest2017-ende.en.de | 24.5 | 0.528 |
| newstest2018-ende.de.en | 34.9 | 0.604 |
| newstest2018-ende.en.de | 34.9 | 0.604 |
| newstest2019-deen.de.en | 31.3 | 0.569 |
| newstest2019-ende.en.de | 32.2 | 0.578 |
| Tatoeba.en.sv | 43.3 | 0.630 |

View File

@ -0,0 +1,16 @@
# opus-2019-12-04.zip
* dataset: opus
* model: transformer
* pre-processing: normalization + tokenization + BPE
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/en+fr-da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+fr-da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+fr-da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
## Benchmarks
| testset | BLEU | chr-F |
|-----------------------|-------|-------|
| Tatoeba.en.sv | 53.0 | 0.685 |

Some files were not shown because too many files have changed in this diff Show More