mirror of
https://github.com/Helsinki-NLP/OPUS-MT-train.git
synced 2024-10-05 16:47:21 +03:00
initial import
This commit is contained in:
commit
b36d9a3e22
396
LICENSE
Normal file
396
LICENSE
Normal file
@ -0,0 +1,396 @@
|
||||
Attribution 4.0 International
|
||||
|
||||
=======================================================================
|
||||
|
||||
Creative Commons Corporation ("Creative Commons") is not a law firm and
|
||||
does not provide legal services or legal advice. Distribution of
|
||||
Creative Commons public licenses does not create a lawyer-client or
|
||||
other relationship. Creative Commons makes its licenses and related
|
||||
information available on an "as-is" basis. Creative Commons gives no
|
||||
warranties regarding its licenses, any material licensed under their
|
||||
terms and conditions, or any related information. Creative Commons
|
||||
disclaims all liability for damages resulting from their use to the
|
||||
fullest extent possible.
|
||||
|
||||
Using Creative Commons Public Licenses
|
||||
|
||||
Creative Commons public licenses provide a standard set of terms and
|
||||
conditions that creators and other rights holders may use to share
|
||||
original works of authorship and other material subject to copyright
|
||||
and certain other rights specified in the public license below. The
|
||||
following considerations are for informational purposes only, are not
|
||||
exhaustive, and do not form part of our licenses.
|
||||
|
||||
Considerations for licensors: Our public licenses are
|
||||
intended for use by those authorized to give the public
|
||||
permission to use material in ways otherwise restricted by
|
||||
copyright and certain other rights. Our licenses are
|
||||
irrevocable. Licensors should read and understand the terms
|
||||
and conditions of the license they choose before applying it.
|
||||
Licensors should also secure all rights necessary before
|
||||
applying our licenses so that the public can reuse the
|
||||
material as expected. Licensors should clearly mark any
|
||||
material not subject to the license. This includes other CC-
|
||||
licensed material, or material used under an exception or
|
||||
limitation to copyright. More considerations for licensors:
|
||||
wiki.creativecommons.org/Considerations_for_licensors
|
||||
|
||||
Considerations for the public: By using one of our public
|
||||
licenses, a licensor grants the public permission to use the
|
||||
licensed material under specified terms and conditions. If
|
||||
the licensor's permission is not necessary for any reason--for
|
||||
example, because of any applicable exception or limitation to
|
||||
copyright--then that use is not regulated by the license. Our
|
||||
licenses grant only permissions under copyright and certain
|
||||
other rights that a licensor has authority to grant. Use of
|
||||
the licensed material may still be restricted for other
|
||||
reasons, including because others have copyright or other
|
||||
rights in the material. A licensor may make special requests,
|
||||
such as asking that all changes be marked or described.
|
||||
Although not required by our licenses, you are encouraged to
|
||||
respect those requests where reasonable. More considerations
|
||||
for the public:
|
||||
wiki.creativecommons.org/Considerations_for_licensees
|
||||
|
||||
=======================================================================
|
||||
|
||||
Creative Commons Attribution 4.0 International Public License
|
||||
|
||||
By exercising the Licensed Rights (defined below), You accept and agree
|
||||
to be bound by the terms and conditions of this Creative Commons
|
||||
Attribution 4.0 International Public License ("Public License"). To the
|
||||
extent this Public License may be interpreted as a contract, You are
|
||||
granted the Licensed Rights in consideration of Your acceptance of
|
||||
these terms and conditions, and the Licensor grants You such rights in
|
||||
consideration of benefits the Licensor receives from making the
|
||||
Licensed Material available under these terms and conditions.
|
||||
|
||||
|
||||
Section 1 -- Definitions.
|
||||
|
||||
a. Adapted Material means material subject to Copyright and Similar
|
||||
Rights that is derived from or based upon the Licensed Material
|
||||
and in which the Licensed Material is translated, altered,
|
||||
arranged, transformed, or otherwise modified in a manner requiring
|
||||
permission under the Copyright and Similar Rights held by the
|
||||
Licensor. For purposes of this Public License, where the Licensed
|
||||
Material is a musical work, performance, or sound recording,
|
||||
Adapted Material is always produced where the Licensed Material is
|
||||
synched in timed relation with a moving image.
|
||||
|
||||
b. Adapter's License means the license You apply to Your Copyright
|
||||
and Similar Rights in Your contributions to Adapted Material in
|
||||
accordance with the terms and conditions of this Public License.
|
||||
|
||||
c. Copyright and Similar Rights means copyright and/or similar rights
|
||||
closely related to copyright including, without limitation,
|
||||
performance, broadcast, sound recording, and Sui Generis Database
|
||||
Rights, without regard to how the rights are labeled or
|
||||
categorized. For purposes of this Public License, the rights
|
||||
specified in Section 2(b)(1)-(2) are not Copyright and Similar
|
||||
Rights.
|
||||
|
||||
d. Effective Technological Measures means those measures that, in the
|
||||
absence of proper authority, may not be circumvented under laws
|
||||
fulfilling obligations under Article 11 of the WIPO Copyright
|
||||
Treaty adopted on December 20, 1996, and/or similar international
|
||||
agreements.
|
||||
|
||||
e. Exceptions and Limitations means fair use, fair dealing, and/or
|
||||
any other exception or limitation to Copyright and Similar Rights
|
||||
that applies to Your use of the Licensed Material.
|
||||
|
||||
f. Licensed Material means the artistic or literary work, database,
|
||||
or other material to which the Licensor applied this Public
|
||||
License.
|
||||
|
||||
g. Licensed Rights means the rights granted to You subject to the
|
||||
terms and conditions of this Public License, which are limited to
|
||||
all Copyright and Similar Rights that apply to Your use of the
|
||||
Licensed Material and that the Licensor has authority to license.
|
||||
|
||||
h. Licensor means the individual(s) or entity(ies) granting rights
|
||||
under this Public License.
|
||||
|
||||
i. Share means to provide material to the public by any means or
|
||||
process that requires permission under the Licensed Rights, such
|
||||
as reproduction, public display, public performance, distribution,
|
||||
dissemination, communication, or importation, and to make material
|
||||
available to the public including in ways that members of the
|
||||
public may access the material from a place and at a time
|
||||
individually chosen by them.
|
||||
|
||||
j. Sui Generis Database Rights means rights other than copyright
|
||||
resulting from Directive 96/9/EC of the European Parliament and of
|
||||
the Council of 11 March 1996 on the legal protection of databases,
|
||||
as amended and/or succeeded, as well as other essentially
|
||||
equivalent rights anywhere in the world.
|
||||
|
||||
k. You means the individual or entity exercising the Licensed Rights
|
||||
under this Public License. Your has a corresponding meaning.
|
||||
|
||||
|
||||
Section 2 -- Scope.
|
||||
|
||||
a. License grant.
|
||||
|
||||
1. Subject to the terms and conditions of this Public License,
|
||||
the Licensor hereby grants You a worldwide, royalty-free,
|
||||
non-sublicensable, non-exclusive, irrevocable license to
|
||||
exercise the Licensed Rights in the Licensed Material to:
|
||||
|
||||
a. reproduce and Share the Licensed Material, in whole or
|
||||
in part; and
|
||||
|
||||
b. produce, reproduce, and Share Adapted Material.
|
||||
|
||||
2. Exceptions and Limitations. For the avoidance of doubt, where
|
||||
Exceptions and Limitations apply to Your use, this Public
|
||||
License does not apply, and You do not need to comply with
|
||||
its terms and conditions.
|
||||
|
||||
3. Term. The term of this Public License is specified in Section
|
||||
6(a).
|
||||
|
||||
4. Media and formats; technical modifications allowed. The
|
||||
Licensor authorizes You to exercise the Licensed Rights in
|
||||
all media and formats whether now known or hereafter created,
|
||||
and to make technical modifications necessary to do so. The
|
||||
Licensor waives and/or agrees not to assert any right or
|
||||
authority to forbid You from making technical modifications
|
||||
necessary to exercise the Licensed Rights, including
|
||||
technical modifications necessary to circumvent Effective
|
||||
Technological Measures. For purposes of this Public License,
|
||||
simply making modifications authorized by this Section 2(a)
|
||||
(4) never produces Adapted Material.
|
||||
|
||||
5. Downstream recipients.
|
||||
|
||||
a. Offer from the Licensor -- Licensed Material. Every
|
||||
recipient of the Licensed Material automatically
|
||||
receives an offer from the Licensor to exercise the
|
||||
Licensed Rights under the terms and conditions of this
|
||||
Public License.
|
||||
|
||||
b. No downstream restrictions. You may not offer or impose
|
||||
any additional or different terms or conditions on, or
|
||||
apply any Effective Technological Measures to, the
|
||||
Licensed Material if doing so restricts exercise of the
|
||||
Licensed Rights by any recipient of the Licensed
|
||||
Material.
|
||||
|
||||
6. No endorsement. Nothing in this Public License constitutes or
|
||||
may be construed as permission to assert or imply that You
|
||||
are, or that Your use of the Licensed Material is, connected
|
||||
with, or sponsored, endorsed, or granted official status by,
|
||||
the Licensor or others designated to receive attribution as
|
||||
provided in Section 3(a)(1)(A)(i).
|
||||
|
||||
b. Other rights.
|
||||
|
||||
1. Moral rights, such as the right of integrity, are not
|
||||
licensed under this Public License, nor are publicity,
|
||||
privacy, and/or other similar personality rights; however, to
|
||||
the extent possible, the Licensor waives and/or agrees not to
|
||||
assert any such rights held by the Licensor to the limited
|
||||
extent necessary to allow You to exercise the Licensed
|
||||
Rights, but not otherwise.
|
||||
|
||||
2. Patent and trademark rights are not licensed under this
|
||||
Public License.
|
||||
|
||||
3. To the extent possible, the Licensor waives any right to
|
||||
collect royalties from You for the exercise of the Licensed
|
||||
Rights, whether directly or through a collecting society
|
||||
under any voluntary or waivable statutory or compulsory
|
||||
licensing scheme. In all other cases the Licensor expressly
|
||||
reserves any right to collect such royalties.
|
||||
|
||||
|
||||
Section 3 -- License Conditions.
|
||||
|
||||
Your exercise of the Licensed Rights is expressly made subject to the
|
||||
following conditions.
|
||||
|
||||
a. Attribution.
|
||||
|
||||
1. If You Share the Licensed Material (including in modified
|
||||
form), You must:
|
||||
|
||||
a. retain the following if it is supplied by the Licensor
|
||||
with the Licensed Material:
|
||||
|
||||
i. identification of the creator(s) of the Licensed
|
||||
Material and any others designated to receive
|
||||
attribution, in any reasonable manner requested by
|
||||
the Licensor (including by pseudonym if
|
||||
designated);
|
||||
|
||||
ii. a copyright notice;
|
||||
|
||||
iii. a notice that refers to this Public License;
|
||||
|
||||
iv. a notice that refers to the disclaimer of
|
||||
warranties;
|
||||
|
||||
v. a URI or hyperlink to the Licensed Material to the
|
||||
extent reasonably practicable;
|
||||
|
||||
b. indicate if You modified the Licensed Material and
|
||||
retain an indication of any previous modifications; and
|
||||
|
||||
c. indicate the Licensed Material is licensed under this
|
||||
Public License, and include the text of, or the URI or
|
||||
hyperlink to, this Public License.
|
||||
|
||||
2. You may satisfy the conditions in Section 3(a)(1) in any
|
||||
reasonable manner based on the medium, means, and context in
|
||||
which You Share the Licensed Material. For example, it may be
|
||||
reasonable to satisfy the conditions by providing a URI or
|
||||
hyperlink to a resource that includes the required
|
||||
information.
|
||||
|
||||
3. If requested by the Licensor, You must remove any of the
|
||||
information required by Section 3(a)(1)(A) to the extent
|
||||
reasonably practicable.
|
||||
|
||||
4. If You Share Adapted Material You produce, the Adapter's
|
||||
License You apply must not prevent recipients of the Adapted
|
||||
Material from complying with this Public License.
|
||||
|
||||
|
||||
Section 4 -- Sui Generis Database Rights.
|
||||
|
||||
Where the Licensed Rights include Sui Generis Database Rights that
|
||||
apply to Your use of the Licensed Material:
|
||||
|
||||
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
|
||||
to extract, reuse, reproduce, and Share all or a substantial
|
||||
portion of the contents of the database;
|
||||
|
||||
b. if You include all or a substantial portion of the database
|
||||
contents in a database in which You have Sui Generis Database
|
||||
Rights, then the database in which You have Sui Generis Database
|
||||
Rights (but not its individual contents) is Adapted Material; and
|
||||
|
||||
c. You must comply with the conditions in Section 3(a) if You Share
|
||||
all or a substantial portion of the contents of the database.
|
||||
|
||||
For the avoidance of doubt, this Section 4 supplements and does not
|
||||
replace Your obligations under this Public License where the Licensed
|
||||
Rights include other Copyright and Similar Rights.
|
||||
|
||||
|
||||
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
|
||||
|
||||
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
|
||||
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
|
||||
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
|
||||
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
|
||||
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
|
||||
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
||||
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
|
||||
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
|
||||
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
|
||||
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
|
||||
|
||||
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
|
||||
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
|
||||
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
|
||||
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
|
||||
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
|
||||
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
|
||||
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
|
||||
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
|
||||
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
|
||||
|
||||
c. The disclaimer of warranties and limitation of liability provided
|
||||
above shall be interpreted in a manner that, to the extent
|
||||
possible, most closely approximates an absolute disclaimer and
|
||||
waiver of all liability.
|
||||
|
||||
|
||||
Section 6 -- Term and Termination.
|
||||
|
||||
a. This Public License applies for the term of the Copyright and
|
||||
Similar Rights licensed here. However, if You fail to comply with
|
||||
this Public License, then Your rights under this Public License
|
||||
terminate automatically.
|
||||
|
||||
b. Where Your right to use the Licensed Material has terminated under
|
||||
Section 6(a), it reinstates:
|
||||
|
||||
1. automatically as of the date the violation is cured, provided
|
||||
it is cured within 30 days of Your discovery of the
|
||||
violation; or
|
||||
|
||||
2. upon express reinstatement by the Licensor.
|
||||
|
||||
For the avoidance of doubt, this Section 6(b) does not affect any
|
||||
right the Licensor may have to seek remedies for Your violations
|
||||
of this Public License.
|
||||
|
||||
c. For the avoidance of doubt, the Licensor may also offer the
|
||||
Licensed Material under separate terms or conditions or stop
|
||||
distributing the Licensed Material at any time; however, doing so
|
||||
will not terminate this Public License.
|
||||
|
||||
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
|
||||
License.
|
||||
|
||||
|
||||
Section 7 -- Other Terms and Conditions.
|
||||
|
||||
a. The Licensor shall not be bound by any additional or different
|
||||
terms or conditions communicated by You unless expressly agreed.
|
||||
|
||||
b. Any arrangements, understandings, or agreements regarding the
|
||||
Licensed Material not stated herein are separate from and
|
||||
independent of the terms and conditions of this Public License.
|
||||
|
||||
|
||||
Section 8 -- Interpretation.
|
||||
|
||||
a. For the avoidance of doubt, this Public License does not, and
|
||||
shall not be interpreted to, reduce, limit, restrict, or impose
|
||||
conditions on any use of the Licensed Material that could lawfully
|
||||
be made without permission under this Public License.
|
||||
|
||||
b. To the extent possible, if any provision of this Public License is
|
||||
deemed unenforceable, it shall be automatically reformed to the
|
||||
minimum extent necessary to make it enforceable. If the provision
|
||||
cannot be reformed, it shall be severed from this Public License
|
||||
without affecting the enforceability of the remaining terms and
|
||||
conditions.
|
||||
|
||||
c. No term or condition of this Public License will be waived and no
|
||||
failure to comply consented to unless expressly agreed to by the
|
||||
Licensor.
|
||||
|
||||
d. Nothing in this Public License constitutes or may be interpreted
|
||||
as a limitation upon, or waiver of, any privileges and immunities
|
||||
that apply to the Licensor or You, including from the legal
|
||||
processes of any jurisdiction or authority.
|
||||
|
||||
|
||||
=======================================================================
|
||||
|
||||
Creative Commons is not a party to its public
|
||||
licenses. Notwithstanding, Creative Commons may elect to apply one of
|
||||
its public licenses to material it publishes and in those instances
|
||||
will be considered the “Licensor.” The text of the Creative Commons
|
||||
public licenses is dedicated to the public domain under the CC0 Public
|
||||
Domain Dedication. Except for the limited purpose of indicating that
|
||||
material is shared under a Creative Commons public license or as
|
||||
otherwise permitted by the Creative Commons policies published at
|
||||
creativecommons.org/policies, Creative Commons does not authorize the
|
||||
use of the trademark "Creative Commons" or any other trademark or logo
|
||||
of Creative Commons without its prior written consent including,
|
||||
without limitation, in connection with any unauthorized modifications
|
||||
to any of its public licenses or any other arrangements,
|
||||
understandings, or agreements concerning use of licensed material. For
|
||||
the avoidance of doubt, this paragraph does not form part of the
|
||||
public licenses.
|
||||
|
||||
Creative Commons may be contacted at creativecommons.org.
|
||||
|
564
Makefile
Normal file
564
Makefile
Normal file
@ -0,0 +1,564 @@
|
||||
# -*-makefile-*-
|
||||
#
|
||||
# train Opus-MT models using MarianNMT
|
||||
#
|
||||
#--------------------------------------------------------------------
|
||||
#
|
||||
# (1) train NMT model
|
||||
#
|
||||
# make train .............. train NMT model for current language pair
|
||||
#
|
||||
# (2) translate and evaluate
|
||||
#
|
||||
# make translate .......... translate test set
|
||||
# make eval ............... evaluate
|
||||
#
|
||||
#--------------------------------------------------------------------
|
||||
#
|
||||
# general parameters / variables (see Makefile.config)
|
||||
# SRCLANGS ............ set source language(s) (en)
|
||||
# TRGLANGS ............ set target language(s) (de)
|
||||
#
|
||||
#
|
||||
# submit jobs by adding suffix to make-target to be run
|
||||
# .submit ........ job on GPU nodes (for train and translate)
|
||||
# .submitcpu ..... job on CPU nodes (for translate and eval)
|
||||
#
|
||||
# for example:
|
||||
# make train.submit
|
||||
#
|
||||
# run a multigpu job, for example
|
||||
# make train-multigpu.submit
|
||||
# make train-twogpu.submit
|
||||
# make train-gpu01.submit
|
||||
# make train-gpu23.submit
|
||||
#
|
||||
#
|
||||
# typical procedure: train and evaluate en-de with 3 models in ensemble
|
||||
#
|
||||
# make data.submitcpu
|
||||
# make vocab.submit
|
||||
# make NR=1 train.submit
|
||||
# make NR=2 train.submit
|
||||
# make NR=3 train.submit
|
||||
#
|
||||
# make NR=1 eval.submit
|
||||
# make NR=2 eval.submit
|
||||
# make NR=3 eval.submit
|
||||
# make eval-ensemble.submit
|
||||
#
|
||||
#
|
||||
# include right-to-left models:
|
||||
#
|
||||
# make NR=1 train-RL.submit
|
||||
# make NR=2 train-RL.submit
|
||||
# make NR=3 train-RL.submit
|
||||
#
|
||||
#
|
||||
#--------------------------------------------------------------------
|
||||
# train several versions of the same model (for ensembling)
|
||||
#
|
||||
# make NR=1 ....
|
||||
# make NR=2 ....
|
||||
# make NR=3 ....
|
||||
#
|
||||
# DANGER: problem with vocabulary files if you start them simultaneously
|
||||
# --> racing situation for creating them between the processes
|
||||
#
|
||||
#--------------------------------------------------------------------
|
||||
# resume training
|
||||
#
|
||||
# make resume
|
||||
#
|
||||
#--------------------------------------------------------------------
|
||||
# translate with ensembles of models
|
||||
#
|
||||
# make translate-ensemble
|
||||
# make eval-ensemble
|
||||
#
|
||||
# this only makes sense if there are several models
|
||||
# (created with different NR)
|
||||
#--------------------------------------------------------------------
|
||||
|
||||
|
||||
# check and adjust Makfile.env and Makfile.config
|
||||
# add specific tasks in Makefile.tasks
|
||||
|
||||
|
||||
include Makefile.env
|
||||
include Makefile.config
|
||||
include Makefile.dist
|
||||
include Makefile.tasks
|
||||
include Makefile.data
|
||||
include Makefile.doclevel
|
||||
include Makefile.slurm
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# make various data sets
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
|
||||
.PHONY: data
|
||||
data: ${TRAIN_SRC}.clean.${PRE_SRC}.gz ${TRAIN_TRG}.clean.${PRE_TRG}.gz \
|
||||
${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
|
||||
${MAKE} ${TEST_SRC}.${PRE_SRC} ${TEST_TRG}
|
||||
${MAKE} ${TRAIN_ALG}
|
||||
${MAKE} ${MODEL_VOCAB}
|
||||
|
||||
|
||||
traindata: ${TRAIN_SRC}.clean.${PRE_SRC}.gz ${TRAIN_TRG}.clean.${PRE_TRG}.gz
|
||||
tunedata: ${TUNE_SRC}.${PRE_SRC} ${TUNE_TRG}.${PRE_TRG}
|
||||
devdata: ${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
|
||||
testdata: ${TEST_SRC}.${PRE_SRC} ${TEST_TRG}
|
||||
wordalign: ${TRAIN_ALG}
|
||||
|
||||
devdata-raw: ${DEV_SRC} ${DEV_TRG}
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# train, translate and evaluate
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
|
||||
## other model types
|
||||
vocab: ${MODEL_VOCAB}
|
||||
train: ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.done
|
||||
translate: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}
|
||||
eval: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}.eval
|
||||
compare: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}.compare
|
||||
|
||||
## ensemble of models (assumes to find them in subdirs of the WORKDIR)
|
||||
translate-ensemble: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.ensemble.${SRC}.${TRG}
|
||||
eval-ensemble: ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.ensemble.${SRC}.${TRG}.eval
|
||||
|
||||
|
||||
## resume training on an existing model
|
||||
resume:
|
||||
if [ -e ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.npz.best-perplexity.npz ]; then \
|
||||
cp ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.npz.best-perplexity.npz \
|
||||
${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.npz; \
|
||||
fi
|
||||
sleep 1
|
||||
rm -f ${WORKDIR}/${MODEL}.${MODELTYPE}.model${NR}.done
|
||||
${MAKE} train
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# translate and evaluate all test sets in testsets/
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
## testset dir for all test sets in this language pair
|
||||
## and all trokenized test sets that can be found in that directory
|
||||
TESTSET_HOME = ${PWD}/testsets
|
||||
TESTSET_DIR = ${TESTSET_HOME}/${SRC}-${TRG}
|
||||
# TESTSETS = $(patsubst ${TESTSET_DIR}/%.${SRC}.tok.gz,%,${wildcard ${TESTSET_DIR}/*.${SRC}.tok.gz})
|
||||
TESTSETS = $(patsubst ${TESTSET_DIR}/%.${SRC}.gz,%,${wildcard ${TESTSET_DIR}/*.${SRC}.gz})
|
||||
TESTSETS_PRESRC = $(patsubst %.gz,%.${PRE}.gz,${sort $(subst .${PRE},,${wildcard ${TESTSET_DIR}/*.${SRC}.gz})})
|
||||
TESTSETS_PRETRG = $(patsubst %.gz,%.${PRE}.gz,${sort $(subst .${PRE},,${wildcard ${TESTSET_DIR}/*.${TRG}.gz})})
|
||||
|
||||
## eval all available test sets
|
||||
eval-testsets:
|
||||
for s in ${SRCLANGS}; do \
|
||||
for t in ${TRGLANGS}; do \
|
||||
${MAKE} SRC=$$s TRG=$$t compare-testsets-langpair; \
|
||||
done \
|
||||
done
|
||||
|
||||
eval-heldout:
|
||||
${MAKE} TESTSET_HOME=${HELDOUT_DIR} eval-testsets
|
||||
|
||||
%-testsets-langpair: ${TESTSETS_PRESRC} ${TESTSETS_PRETRG}
|
||||
@echo "testsets: ${TESTSET_DIR}/*.${SRC}.gz"
|
||||
for t in ${TESTSETS}; do \
|
||||
${MAKE} TESTSET=$$t ${@:-testsets-langpair=}; \
|
||||
done
|
||||
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# some helper functions
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
|
||||
## check whether a model is converged or not
|
||||
finished:
|
||||
@if grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKDIR}/${MODEL_VALIDLOG}; then\
|
||||
echo "${WORKDIR}/${MODEL_BASENAME} finished"; \
|
||||
else \
|
||||
echo "${WORKDIR}/${MODEL_BASENAME} unfinished"; \
|
||||
fi
|
||||
|
||||
|
||||
## extension -all: run something over all language pairs, e.g.
|
||||
## make wordalign-all
|
||||
## this goes sequentially over all language pairs
|
||||
## for the parallelizable version of this: look at %-all-parallel
|
||||
%-all:
|
||||
for l in ${ALL_LANG_PAIRS}; do \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" ${@:-all=}; \
|
||||
done
|
||||
|
||||
# run something over all language pairs that have trained models
|
||||
## - make eval-allmodels
|
||||
## - make dist-allmodels
|
||||
%-allmodels:
|
||||
for l in ${ALL_LANG_PAIRS}; do \
|
||||
if [ `find ${WORKHOME}/$$l -name '*.${PRE_SRC}-${PRE_TRG}.*.npz' | wc -l` -gt 0 ]; then \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" ${@:-allmodels=}; \
|
||||
fi \
|
||||
done
|
||||
|
||||
## only bilingual models
|
||||
%-allbilingual:
|
||||
for l in ${ALL_BILINGUAL_MODELS}; do \
|
||||
if [ `find ${WORKHOME}/$$l -name '*.${PRE_SRC}-${PRE_TRG}.*.npz' | wc -l` -gt 0 ]; then \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" ${@:-allbilingual=}; \
|
||||
fi \
|
||||
done
|
||||
|
||||
|
||||
## run something over all language pairs but make it possible to do it in parallel, for example
|
||||
## - make dist-all-parallel
|
||||
%-all-parallel:
|
||||
${MAKE} $(subst -all-parallel,,${patsubst %,$@__%-run-for-langpair,${ALL_LANG_PAIRS}})
|
||||
|
||||
## run a command that includes the langpair, for example
|
||||
## make wordalign__en-da+sv-run-for-langpair ...... runs wordalign with SRCLANGS="en" TRGLANGS="da sv"
|
||||
## What is this good for?
|
||||
## ---> can run many lang-pairs in parallel instead of having a for loop and run sequencetially
|
||||
%-run-for-langpair:
|
||||
${MAKE} SRCLANGS='$(subst +, ,$(firstword $(subst -, ,${lastword ${subst __, ,${@:-run-for-langpair=}}})))' \
|
||||
TRGLANGS='$(subst +, ,$(lastword $(subst -, ,${lastword ${subst __, ,${@:-run-for-langpair=}}})))' \
|
||||
${shell echo $@ | sed 's/__.*$$//'}
|
||||
|
||||
|
||||
## right-to-left model
|
||||
%-RL:
|
||||
${MAKE} MODEL=${MODEL}-RL \
|
||||
MARIAN_EXTRA="${MARIAN_EXTRA} --right-left" \
|
||||
${@:-RL=}
|
||||
|
||||
|
||||
## run a multigpu job (2 or 4 GPUs)
|
||||
|
||||
%-multigpu %-0123:
|
||||
${MAKE} NR_GPUS=4 MARIAN_GPUS='0 1 2 3' $(subst -gpu0123,,${@:-multigpu=})
|
||||
|
||||
%-twogpu %-gpu01:
|
||||
${MAKE} NR_GPUS=2 MARIAN_GPUS='0 1' $(subst -gpu01,,${@:-twogpu=})
|
||||
|
||||
%-gpu23:
|
||||
${MAKE} NR_GPUS=2 MARIAN_GPUS='2 3' ${@:-gpu23=}
|
||||
|
||||
|
||||
## run on CPUs (translate-cpu, eval-cpu, translate-ensemble-cpu, ...)
|
||||
%-cpu:
|
||||
${MAKE} MARIAN=${MARIANCPU} \
|
||||
LOADMODS='${LOADCPU}' \
|
||||
MARIAN_DECODER_FLAGS="${MARIAN_DECODER_CPU}" \
|
||||
${@:-cpu=}
|
||||
|
||||
|
||||
## document level models
|
||||
%-doc:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-spm} \
|
||||
PRE=norm \
|
||||
PRE_SRC=spm${SRCBPESIZE:000=}k.doc${CONTEXT_SIZE} \
|
||||
PRE_TRG=spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE} \
|
||||
${@:-doc=}
|
||||
|
||||
|
||||
## sentence-piece models
|
||||
%-spm:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-spm} \
|
||||
PRE=norm \
|
||||
PRE_SRC=spm${SRCBPESIZE:000=}k \
|
||||
PRE_TRG=spm${TRGBPESIZE:000=}k \
|
||||
${@:-spm=}
|
||||
|
||||
%-spm-noalign:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-spm-noalign} \
|
||||
MODELTYPE=transformer \
|
||||
PRE=norm \
|
||||
PRE_SRC=spm${SRCBPESIZE:000=}k \
|
||||
PRE_TRG=spm${TRGBPESIZE:000=}k \
|
||||
${@:-spm-noalign=}
|
||||
|
||||
|
||||
|
||||
## BPE models
|
||||
%-bpe:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe} \
|
||||
PRE=tok \
|
||||
MODELTYPE=transformer \
|
||||
PRE_SRC=bpe${SRCBPESIZE:000=}k \
|
||||
PRE_TRG=bpe${TRGBPESIZE:000=}k \
|
||||
${@:-bpe=}
|
||||
|
||||
%-bpe-align:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe-align} \
|
||||
PRE=tok \
|
||||
PRE_SRC=bpe${SRCBPESIZE:000=}k \
|
||||
PRE_TRG=bpe${TRGBPESIZE:000=}k \
|
||||
${@:-bpe-align=}
|
||||
|
||||
%-bpe-memad:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe-memad} \
|
||||
PRE=tok \
|
||||
MODELTYPE=transformer \
|
||||
PRE_SRC=bpe${SRCBPESIZE:000=}k \
|
||||
PRE_TRG=bpe${TRGBPESIZE:000=}k \
|
||||
${@:-bpe-memad=}
|
||||
|
||||
%-bpe-old:
|
||||
${MAKE} WORKHOME=${shell realpath ${PWD}/work-bpe-old} \
|
||||
PRE=tok \
|
||||
MODELTYPE=transformer \
|
||||
PRE_SRC=bpe${SRCBPESIZE:000=}k \
|
||||
PRE_TRG=bpe${TRGBPESIZE:000=}k \
|
||||
${@:-bpe-old=}
|
||||
|
||||
|
||||
## for the inbuilt sentence-piece segmentation:
|
||||
# PRE_SRC=txt PRE_TRG=txt
|
||||
# MARIAN=${MARIAN}-spm
|
||||
# MODEL_VOCABTYPE=spm
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## continue document-level training with a new context size
|
||||
|
||||
ifndef NEW_CONTEXT
|
||||
NEW_CONTEXT = $$(($(CONTEXT_SIZE) + $(CONTEXT_SIZE)))
|
||||
endif
|
||||
|
||||
continue-doctrain:
|
||||
mkdir -p ${WORKDIR}/${MODEL}
|
||||
cp ${MODEL_VOCAB} ${WORKDIR}/${MODEL}/$(subst .doc${CONTEXT_SIZE},.doc${NEW_CONTEXT},${notdir ${MODEL_VOCAB}})
|
||||
cp ${MODEL_FINAL} ${WORKDIR}/${MODEL}/$(subst .doc${CONTEXT_SIZE},.doc${NEW_CONTEXT},$(notdir ${MODEL_BASENAME})).npz
|
||||
${MAKE} MODEL_SUBDIR=${MODEL}/ CONTEXT_SIZE=$(NEW_CONTEXT) train-doc
|
||||
|
||||
|
||||
|
||||
|
||||
## continue training with a new dataset
|
||||
|
||||
ifndef NEW_DATASET
|
||||
NEW_DATASET = OpenSubtitles
|
||||
endif
|
||||
|
||||
continue-datatrain:
|
||||
mkdir -p ${WORKDIR}/${MODEL}
|
||||
cp ${MODEL_VOCAB} ${WORKDIR}/${MODEL}/$(patsubst ${DATASET}%,${NEW_DATASET}%,${notdir ${MODEL_VOCAB}})
|
||||
cp ${MODEL_FINAL} ${WORKDIR}/${MODEL}/$(patsubst ${DATASET}%,${NEW_DATASET}%,${MODEL_BASENAME}).npz
|
||||
if [ -e ${BPESRCMODEL} ]; then \
|
||||
cp ${BPESRCMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${BPESRCMODEL}); \
|
||||
cp ${BPETRGMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${BPETRGMODEL}); \
|
||||
fi
|
||||
if [ -e ${SPMSRCMODEL} ]; then \
|
||||
cp ${SPMSRCMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${SPMSRCMODEL}); \
|
||||
cp ${SPMTRGMODEL} $(patsubst ${WORKDIR}/train/${DATASET}%,${WORKDIR}/train/${NEW_DATASET}%,${SPMTRGMODEL}); \
|
||||
fi
|
||||
${MAKE} MODEL_SUBDIR=${MODEL}/ DATASET=$(NEW_DATASET) train
|
||||
|
||||
|
||||
# MARIAN_EXTRA="${MARIAN_EXTRA} --no-restore-corpus"
|
||||
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# training MarianNMT models
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
|
||||
## make vocabulary
|
||||
## - no new vocabulary is created if the file already exists!
|
||||
## - need to delete the file if you want to create a new one!
|
||||
|
||||
${MODEL_VOCAB}: ${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
|
||||
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz
|
||||
ifeq ($(wildcard ${MODEL_VOCAB}),)
|
||||
mkdir -p ${dir $@}
|
||||
${LOADMODS} && zcat $^ | ${MARIAN}/marian-vocab --max-size ${VOCABSIZE} > $@
|
||||
else
|
||||
@echo "$@ already exists!"
|
||||
@echo "WARNING! No new vocabulary is created even though the data has changed!"
|
||||
@echo "WARNING! Delete the file if you want to start from scratch!"
|
||||
touch $@
|
||||
endif
|
||||
|
||||
|
||||
## NEW: take away dependency on ${MODEL_VOCAB}
|
||||
## (will be created by marian if it does not exist)
|
||||
|
||||
## train transformer model
|
||||
${WORKDIR}/${MODEL}.transformer.model${NR}.done: \
|
||||
${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
|
||||
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz \
|
||||
${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
|
||||
mkdir -p ${dir $@}
|
||||
${LOADMODS} && ${MARIAN}/marian ${MARIAN_EXTRA} \
|
||||
--model $(@:.done=.npz) \
|
||||
--type transformer \
|
||||
--train-sets ${word 1,$^} ${word 2,$^} ${MARIAN_TRAIN_WEIGHTS} \
|
||||
--max-length 500 \
|
||||
--vocabs ${MODEL_VOCAB} ${MODEL_VOCAB} \
|
||||
--mini-batch-fit \
|
||||
-w ${MARIAN_WORKSPACE} \
|
||||
--maxi-batch ${MARIAN_MAXI_BATCH} \
|
||||
--early-stopping ${MARIAN_EARLY_STOPPING} \
|
||||
--valid-freq ${MARIAN_VALID_FREQ} \
|
||||
--save-freq ${MARIAN_SAVE_FREQ} \
|
||||
--disp-freq ${MARIAN_DISP_FREQ} \
|
||||
--valid-sets ${word 3,$^} ${word 4,$^} \
|
||||
--valid-metrics perplexity \
|
||||
--valid-mini-batch ${MARIAN_VALID_MINI_BATCH} \
|
||||
--beam-size 12 --normalize 1 \
|
||||
--log $(@:.model${NR}.done=.train${NR}.log) --valid-log $(@:.model${NR}.done=.valid${NR}.log) \
|
||||
--enc-depth 6 --dec-depth 6 \
|
||||
--transformer-heads 8 \
|
||||
--transformer-postprocess-emb d \
|
||||
--transformer-postprocess dan \
|
||||
--transformer-dropout ${MARIAN_DROPOUT} \
|
||||
--label-smoothing 0.1 \
|
||||
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
|
||||
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
|
||||
--tied-embeddings-all \
|
||||
--overwrite --keep-best \
|
||||
--devices ${MARIAN_GPUS} \
|
||||
--sync-sgd --seed ${SEED} \
|
||||
--sqlite \
|
||||
--tempdir ${TMPDIR} \
|
||||
--exponential-smoothing
|
||||
touch $@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## NEW: take away dependency on ${MODEL_VOCAB}
|
||||
|
||||
## train transformer model with guided alignment
|
||||
${WORKDIR}/${MODEL}.transformer-align.model${NR}.done: \
|
||||
${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
|
||||
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz \
|
||||
${TRAIN_ALG} \
|
||||
${DEV_SRC}.${PRE_SRC} ${DEV_TRG}.${PRE_TRG}
|
||||
mkdir -p ${dir $@}
|
||||
${LOADMODS} && ${MARIAN}/marian ${MARIAN_EXTRA} \
|
||||
--model $(@:.done=.npz) \
|
||||
--type transformer \
|
||||
--train-sets ${word 1,$^} ${word 2,$^} ${MARIAN_TRAIN_WEIGHTS} \
|
||||
--max-length 500 \
|
||||
--vocabs ${MODEL_VOCAB} ${MODEL_VOCAB} \
|
||||
--mini-batch-fit \
|
||||
-w ${MARIAN_WORKSPACE} \
|
||||
--maxi-batch ${MARIAN_MAXI_BATCH} \
|
||||
--early-stopping ${MARIAN_EARLY_STOPPING} \
|
||||
--valid-freq ${MARIAN_VALID_FREQ} \
|
||||
--save-freq ${MARIAN_SAVE_FREQ} \
|
||||
--disp-freq ${MARIAN_DISP_FREQ} \
|
||||
--valid-sets ${word 4,$^} ${word 5,$^} \
|
||||
--valid-metrics perplexity \
|
||||
--valid-mini-batch ${MARIAN_VALID_MINI_BATCH} \
|
||||
--beam-size 12 --normalize 1 \
|
||||
--log $(@:.model${NR}.done=.train${NR}.log) --valid-log $(@:.model${NR}.done=.valid${NR}.log) \
|
||||
--enc-depth 6 --dec-depth 6 \
|
||||
--transformer-heads 8 \
|
||||
--transformer-postprocess-emb d \
|
||||
--transformer-postprocess dan \
|
||||
--transformer-dropout ${MARIAN_DROPOUT} \
|
||||
--label-smoothing 0.1 \
|
||||
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
|
||||
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
|
||||
--tied-embeddings-all \
|
||||
--overwrite --keep-best \
|
||||
--devices ${MARIAN_GPUS} \
|
||||
--sync-sgd --seed ${SEED} \
|
||||
--sqlite \
|
||||
--tempdir ${TMPDIR} \
|
||||
--exponential-smoothing \
|
||||
--guided-alignment ${word 3,$^}
|
||||
touch $@
|
||||
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# translate with an ensemble of several models
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
ENSEMBLE = ${wildcard ${WORKDIR}/${MODEL}.${MODELTYPE}.model*.npz.best-perplexity.npz}
|
||||
|
||||
${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.ensemble.${SRC}.${TRG}: ${TEST_SRC}.${PRE_SRC} ${ENSEMBLE}
|
||||
mkdir -p ${dir $@}
|
||||
grep . $< > $@.input
|
||||
${LOADMODS} && ${MARIAN}/marian-decoder -i $@.input \
|
||||
--models ${ENSEMBLE} \
|
||||
--vocabs ${WORKDIR}/${MODEL}.vocab.yml \
|
||||
${WORKDIR}/${MODEL}.vocab.yml \
|
||||
${WORKDIR}/${MODEL}.vocab.yml \
|
||||
${MARIAN_DECODER_FLAGS} > $@.output
|
||||
ifeq (${PRE_TRG},spm${TRGBPESIZE:000=}k)
|
||||
sed 's/ //g;s/▁/ /g' < $@.output | sed 's/^ *//;s/ *$$//' > $@
|
||||
else
|
||||
sed 's/\@\@ //g;s/ \@\@//g;s/ \@\-\@ /-/g' < $@.output |\
|
||||
$(TOKENIZER)/detokenizer.perl -l ${TRG} > $@
|
||||
endif
|
||||
rm -f $@.input $@.output
|
||||
|
||||
|
||||
#------------------------------------------------------------------------
|
||||
# translate, evaluate and generate a file
|
||||
# for comparing system to reference translations
|
||||
#------------------------------------------------------------------------
|
||||
|
||||
${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}: ${TEST_SRC}.${PRE_SRC} ${MODEL_FINAL}
|
||||
mkdir -p ${dir $@}
|
||||
grep . $< > $@.input
|
||||
${LOADMODS} && ${MARIAN}/marian-decoder -i $@.input \
|
||||
-c ${word 2,$^}.decoder.yml \
|
||||
-d ${MARIAN_GPUS} \
|
||||
${MARIAN_DECODER_FLAGS} > $@.output
|
||||
ifeq (${PRE_TRG},spm${TRGBPESIZE:000=}k)
|
||||
sed 's/ //g;s/▁/ /g' < $@.output | sed 's/^ *//;s/ *$$//' > $@
|
||||
else
|
||||
sed 's/\@\@ //g;s/ \@\@//g;s/ \@\-\@ /-/g' < $@.output |\
|
||||
$(TOKENIZER)/detokenizer.perl -l ${TRG} > $@
|
||||
endif
|
||||
rm -f $@.input $@.output
|
||||
|
||||
|
||||
# %.eval: % ${TEST_TRG}
|
||||
# grep . ${TEST_TRG} > $@.ref
|
||||
# grep . $< > $@.sys
|
||||
# cat $@.sys | sacrebleu $@.ref > $@
|
||||
# cat $@.sys | sacrebleu --metrics=chrf --width=3 $@.ref >> $@
|
||||
# rm -f $@.ref $@.sys
|
||||
|
||||
|
||||
%.eval: % ${TEST_TRG}
|
||||
paste ${TEST_SRC}.${PRE_SRC} ${TEST_TRG} | grep $$'.\t' | cut -f2 > $@.ref
|
||||
cat $< | sacrebleu $@.ref > $@
|
||||
cat $< | sacrebleu --metrics=chrf --width=3 $@.ref >> $@
|
||||
rm -f $@.ref
|
||||
|
||||
|
||||
%.compare: %.eval
|
||||
paste -d "\n" ${TEST_SRC} ${TEST_TRG} ${<:.eval=} |\
|
||||
sed -e "s/'/'/g" \
|
||||
-e 's/"/"/g' \
|
||||
-e 's/</</g' \
|
||||
-e 's/>/>/g' \
|
||||
-e 's/&/&/g' |\
|
||||
sed 'n;n;G;' > $@
|
269
Makefile.config
Normal file
269
Makefile.config
Normal file
@ -0,0 +1,269 @@
|
||||
# -*-makefile-*-
|
||||
#
|
||||
# model configurations
|
||||
#
|
||||
|
||||
|
||||
# SRCLANGS = da no sv
|
||||
# TRGLANGS = fi
|
||||
|
||||
SRCLANGS = sv
|
||||
TRGLANGS = fi
|
||||
|
||||
ifndef SRC
|
||||
SRC := ${firstword ${SRCLANGS}}
|
||||
endif
|
||||
ifndef TRG
|
||||
TRG := ${lastword ${TRGLANGS}}
|
||||
endif
|
||||
|
||||
|
||||
# sorted languages and langpair used to match resources in OPUS
|
||||
SORTLANGS = $(sort ${SRC} ${TRG})
|
||||
SPACE = $(empty) $(empty)
|
||||
LANGPAIR = ${firstword ${SORTLANGS}}-${lastword ${SORTLANGS}}
|
||||
LANGSTR = ${subst ${SPACE},+,$(SRCLANGS)}-${subst ${SPACE},+,$(TRGLANGS)}
|
||||
|
||||
|
||||
## for same language pairs: add numeric extension
|
||||
ifeq (${SRC},$(TRG))
|
||||
SRCEXT = ${SRC}1
|
||||
TRGEXT = ${SRC}2
|
||||
else
|
||||
SRCEXT = ${SRC}
|
||||
TRGEXT = ${TRG}
|
||||
endif
|
||||
|
||||
|
||||
## all of OPUS (NEW: don't require MOSES format)
|
||||
# OPUSCORPORA = ${patsubst %/latest/moses/${LANGPAIR}.txt.zip,%,\
|
||||
# ${patsubst ${OPUSHOME}/%,%,\
|
||||
# ${shell ls ${OPUSHOME}/*/latest/moses/${LANGPAIR}.txt.zip}}}
|
||||
OPUSCORPORA = ${patsubst %/latest/xml/${LANGPAIR}.xml.gz,%,\
|
||||
${patsubst ${OPUSHOME}/%,%,\
|
||||
${shell ls ${OPUSHOME}/*/latest/xml/${LANGPAIR}.xml.gz}}}
|
||||
|
||||
|
||||
ALL_LANG_PAIRS = ${shell ls ${WORKHOME} | grep -- '-' | grep -v old}
|
||||
ALL_BILINGUAL_MODELS = ${shell ls ${WORKHOME} | grep -- '-' | grep -v old | grep -v -- '\+'}
|
||||
ALL_MULTILINGUAL_MODELS = ${shell ls ${WORKHOME} | grep -- '-' | grep -v old | grep -- '\+'}
|
||||
|
||||
|
||||
## size of dev data, test data and BPE merge operations
|
||||
|
||||
DEVSIZE = 5000
|
||||
TESTSIZE = 5000
|
||||
|
||||
## NEW: significantly reduce devminsize
|
||||
## (= absolute minimum we need as devdata)
|
||||
## NEW: define an alternative small size for DEV and TEST
|
||||
## OLD DEVMINSIZE:
|
||||
# DEVMINSIZE = 1000
|
||||
|
||||
DEVSMALLSIZE = 1000
|
||||
TESTSMALLSIZE = 1000
|
||||
DEVMINSIZE = 250
|
||||
|
||||
## size of heldout data for each sub-corpus
|
||||
## (only if there is at least twice as many examples in the corpus)
|
||||
HELDOUTSIZE = ${DEVSIZE}
|
||||
|
||||
##----------------------------------------------------------------------------
|
||||
## train/dev/test data
|
||||
##----------------------------------------------------------------------------
|
||||
|
||||
## dev/test data: default = Tatoeba otherwise, GlobalVoices, JW300, GNOME or bibl-uedin
|
||||
## - check that data exist
|
||||
## - check that there are at least 2 x DEVMINSIZE examples
|
||||
ifneq ($(wildcard ${OPUSHOME}/Tatoeba/latest/moses/${LANGPAIR}.txt.zip),)
|
||||
ifeq ($(shell if (( `head -1 ${OPUSHOME}/Tatoeba/latest/info/${LANGPAIR}.txt.info` \
|
||||
> $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then echo "ok"; fi),ok)
|
||||
DEVSET = Tatoeba
|
||||
endif
|
||||
endif
|
||||
|
||||
## backoff to GlobalVoices
|
||||
ifndef DEVSET
|
||||
ifneq ($(wildcard ${OPUSHOME}/GlobalVoices/latest/moses/${LANGPAIR}.txt.zip),)
|
||||
ifeq ($(shell if (( `head -1 ${OPUSHOME}/GlobalVoices/latest/info/${LANGPAIR}.txt.info` \
|
||||
> $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then echo "ok"; fi),ok)
|
||||
DEVSET = GlobalVoices
|
||||
endif
|
||||
endif
|
||||
endif
|
||||
|
||||
## backoff to JW300
|
||||
ifndef DEVSET
|
||||
ifneq ($(wildcard ${OPUSHOME}/JW300/latest/xml/${LANGPAIR}.xml.gz),)
|
||||
ifeq ($(shell if (( `sed -n 2p ${OPUSHOME}/JW300/latest/info/${LANGPAIR}.info` \
|
||||
> $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then echo "ok"; fi),ok)
|
||||
DEVSET = JW300
|
||||
endif
|
||||
endif
|
||||
endif
|
||||
|
||||
## otherwise: bible-uedin
|
||||
ifndef DEVSET
|
||||
DEVSET = bible-uedin
|
||||
endif
|
||||
|
||||
## in case we want to use some additional data sets
|
||||
EXTRA_TRAINSET =
|
||||
|
||||
## TESTSET= DEVSET, TRAINSET = OPUS - WMT-News,DEVSET.TESTSET
|
||||
TESTSET = ${DEVSET}
|
||||
TRAINSET = $(filter-out WMT-News ${DEVSET} ${TESTSET},${OPUSCORPORA} ${EXTRA_TRAINSET})
|
||||
TUNESET = OpenSubtitles
|
||||
|
||||
|
||||
## 1 = use remaining data from dev/test data for training
|
||||
USE_REST_DEVDATA = 1
|
||||
|
||||
|
||||
##----------------------------------------------------------------------------
|
||||
## pre-processing and vocabulary
|
||||
##----------------------------------------------------------------------------
|
||||
|
||||
BPESIZE = 32000
|
||||
SRCBPESIZE = ${BPESIZE}
|
||||
TRGBPESIZE = ${BPESIZE}
|
||||
|
||||
ifndef VOCABSIZE
|
||||
VOCABSIZE = $$((${SRCBPESIZE} + ${TRGBPESIZE} + 1000))
|
||||
endif
|
||||
|
||||
## for document-level models
|
||||
CONTEXT_SIZE = 100
|
||||
|
||||
## pre-processing type
|
||||
PRE = norm
|
||||
PRE_SRC = spm${SRCBPESIZE:000=}k
|
||||
PRE_TRG = spm${TRGBPESIZE:000=}k
|
||||
|
||||
##-------------------------------------
|
||||
## name of the data set (and the model)
|
||||
## - single corpus = use that name
|
||||
## - multiple corpora = opus
|
||||
## add also vocab size to the name
|
||||
##-------------------------------------
|
||||
|
||||
ifndef DATASET
|
||||
ifeq (${words ${TRAINSET}},1)
|
||||
DATASET = ${TRAINSET}
|
||||
else
|
||||
DATASET = opus
|
||||
endif
|
||||
endif
|
||||
|
||||
|
||||
|
||||
## DATADIR = directory where the train/dev/test data are
|
||||
## WORKDIR = directory used for training
|
||||
|
||||
DATADIR = ${WORKHOME}/data
|
||||
WORKDIR = ${WORKHOME}/${LANGSTR}
|
||||
|
||||
|
||||
## data sets
|
||||
TRAIN_BASE = ${WORKDIR}/train/${DATASET}
|
||||
TRAIN_SRC = ${TRAIN_BASE}.src
|
||||
TRAIN_TRG = ${TRAIN_BASE}.trg
|
||||
TRAIN_ALG = ${TRAIN_BASE}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}.src-trg.alg.gz
|
||||
|
||||
## training data in local space
|
||||
LOCAL_TRAIN_SRC = ${TMPDIR}/${LANGSTR}/train/${DATASET}.src
|
||||
LOCAL_TRAIN_TRG = ${TMPDIR}/${LANGSTR}/train/${DATASET}.trg
|
||||
|
||||
TUNE_SRC = ${WORKDIR}/tune/${TUNESET}.src
|
||||
TUNE_TRG = ${WORKDIR}/tune/${TUNESET}.trg
|
||||
|
||||
DEV_SRC = ${WORKDIR}/val/${DEVSET}.src
|
||||
DEV_TRG = ${WORKDIR}/val/${DEVSET}.trg
|
||||
|
||||
TEST_SRC = ${WORKDIR}/test/${TESTSET}.src
|
||||
TEST_TRG = ${WORKDIR}/test/${TESTSET}.trg
|
||||
|
||||
|
||||
## heldout data directory (keep one set per data set)
|
||||
HELDOUT_DIR = ${WORKDIR}/heldout
|
||||
|
||||
MODEL_SUBDIR =
|
||||
MODEL = ${MODEL_SUBDIR}${DATASET}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}
|
||||
MODELTYPE = transformer-align
|
||||
NR = 1
|
||||
|
||||
MODEL_BASENAME = ${MODEL}.${MODELTYPE}.model${NR}
|
||||
MODEL_VALIDLOG = ${MODEL}.${MODELTYPE}.valid${NR}.log
|
||||
MODEL_TRAINLOG = ${MODEL}.${MODELTYPE}.train${NR}.log
|
||||
MODEL_FINAL = ${WORKDIR}/${MODEL_BASENAME}.npz.best-perplexity.npz
|
||||
MODEL_VOCABTYPE = yml
|
||||
MODEL_VOCAB = ${WORKDIR}/${MODEL}.vocab.${MODEL_VOCABTYPE}
|
||||
MODEL_DECODER = ${MODEL_FINAL}.decoder.yml
|
||||
|
||||
|
||||
## test set translation and scores
|
||||
|
||||
TEST_TRANSLATION = ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}
|
||||
TEST_EVALUATION = ${TEST_TRANSLATION}.eval
|
||||
TEST_COMPARISON = ${TEST_TRANSLATION}.compare
|
||||
|
||||
|
||||
|
||||
## parameters for running Marian NMT
|
||||
|
||||
MARIAN_GPUS = 0
|
||||
MARIAN_EXTRA =
|
||||
MARIAN_VALID_FREQ = 10000
|
||||
MARIAN_SAVE_FREQ = ${MARIAN_VALID_FREQ}
|
||||
MARIAN_DISP_FREQ = ${MARIAN_VALID_FREQ}
|
||||
MARIAN_EARLY_STOPPING = 10
|
||||
MARIAN_VALID_MINI_BATCH = 16
|
||||
MARIAN_MAXI_BATCH = 500
|
||||
MARIAN_DROPOUT = 0.1
|
||||
|
||||
MARIAN_DECODER_GPU = -b 12 -n1 -d ${MARIAN_GPUS} --mini-batch 8 --maxi-batch 32 --maxi-batch-sort src
|
||||
MARIAN_DECODER_CPU = -b 12 -n1 --cpu-threads ${HPC_CORES} --mini-batch 8 --maxi-batch 32 --maxi-batch-sort src
|
||||
MARIAN_DECODER_FLAGS = ${MARIAN_DECODER_GPU}
|
||||
|
||||
## TODO: currently marianNMT crashes with workspace > 26000
|
||||
ifeq (${GPU},p100)
|
||||
MARIAN_WORKSPACE = 13000
|
||||
else ifeq (${GPU},v100)
|
||||
# MARIAN_WORKSPACE = 30000
|
||||
# MARIAN_WORKSPACE = 26000
|
||||
# MARIAN_WORKSPACE = 24000
|
||||
# MARIAN_WORKSPACE = 18000
|
||||
MARIAN_WORKSPACE = 16000
|
||||
else
|
||||
MARIAN_WORKSPACE = 10000
|
||||
endif
|
||||
|
||||
|
||||
ifeq (${shell nvidia-smi | grep failed | wc -l},1)
|
||||
MARIAN = ${MARIANCPU}
|
||||
MARIAN_DECODER_FLAGS = ${MARIAN_DECODER_CPU}
|
||||
MARIAN_EXTRA = --cpu-threads ${HPC_CORES}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
|
||||
ifneq ("$(wildcard ${TRAIN_WEIGHTS})","")
|
||||
MARIAN_TRAIN_WEIGHTS = --data-weighting ${TRAIN_WEIGHTS}
|
||||
endif
|
||||
|
||||
|
||||
### training a model with Marian NMT
|
||||
##
|
||||
## NR allows to train several models for proper ensembling
|
||||
## (with shared vocab)
|
||||
##
|
||||
## DANGER: if several models are started at the same time
|
||||
## then there is some racing issue with creating the vocab!
|
||||
|
||||
ifdef NR
|
||||
SEED=${NR}${NR}${NR}${NR}
|
||||
else
|
||||
SEED=1234
|
||||
endif
|
||||
|
866
Makefile.data
Normal file
866
Makefile.data
Normal file
@ -0,0 +1,866 @@
|
||||
# -*-makefile-*-
|
||||
|
||||
ifndef SRCLANGS
|
||||
SRCLANGS=${SRC}
|
||||
endif
|
||||
|
||||
ifndef SRCLANGS
|
||||
TRGLANGS=${TRG}
|
||||
endif
|
||||
|
||||
ifndef THREADS
|
||||
THREADS=${HPC_CORES}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
CLEAN_TRAIN_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${TRAINSET}}
|
||||
CLEAN_TRAIN_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_TRAIN_SRC}}
|
||||
|
||||
CLEAN_TUNE_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${TUNESET}}
|
||||
CLEAN_TUNE_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_TUNE_SRC}}
|
||||
|
||||
CLEAN_DEV_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${DEVSET}}
|
||||
CLEAN_DEV_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_DEV_SRC}}
|
||||
|
||||
CLEAN_TEST_SRC = ${patsubst %,${DATADIR}/${PRE}/%.${LANGPAIR}.clean.${SRCEXT}.gz,${TESTSET}}
|
||||
CLEAN_TEST_TRG = ${patsubst %.${SRCEXT}.gz,%.${TRGEXT}.gz,${CLEAN_TEST_SRC}}
|
||||
|
||||
DATA_SRC := ${sort ${CLEAN_TRAIN_SRC} ${CLEAN_TUNE_SRC} ${CLEAN_DEV_SRC} ${CLEAN_TEST_SRC}}
|
||||
DATA_TRG := ${sort ${CLEAN_TRAIN_TRG} ${CLEAN_TUNE_TRG} ${CLEAN_DEV_TRG} ${CLEAN_TEST_TRG}}
|
||||
|
||||
|
||||
|
||||
|
||||
## make data in reverse direction without re-doing word alignment etc ...
|
||||
## ---> this is dangerous when things run in parallel
|
||||
## ---> only works for bilingual models
|
||||
|
||||
REV_LANGSTR = ${subst ${SPACE},+,$(TRGLANGS)}-${subst ${SPACE},+,$(SRCLANGS)}
|
||||
REV_WORKDIR = ${WORKHOME}/${REV_LANGSTR}
|
||||
|
||||
|
||||
reverse-data:
|
||||
ifeq (${PRE_SRC},${PRE_TRG})
|
||||
ifeq (${words ${SRCLANGS}},1)
|
||||
ifeq (${words ${TRGLANGS}},1)
|
||||
-if [ -e ${TRAIN_SRC}.clean.${PRE_SRC}.gz ]; then \
|
||||
mkdir -p ${REV_WORKDIR}/train; \
|
||||
ln -s ${TRAIN_SRC}.clean.${PRE_SRC}.gz ${REV_WORKDIR}/train/${notdir ${TRAIN_TRG}.clean.${PRE_TRG}.gz}; \
|
||||
ln -s ${TRAIN_TRG}.clean.${PRE_TRG}.gz ${REV_WORKDIR}/train/${notdir ${TRAIN_SRC}.clean.${PRE_SRC}.gz}; \
|
||||
fi
|
||||
-if [ -e ${SPMSRCMODEL} ]; then \
|
||||
ln -s ${SPMSRCMODEL} ${REV_WORKDIR}/train/${notdir ${SPMTRGMODEL}}; \
|
||||
ln -s ${SPMTRGMODEL} ${REV_WORKDIR}/train/${notdir ${SPMSRCMODEL}}; \
|
||||
fi
|
||||
if [ -e ${BPESRCMODEL} ]; then \
|
||||
ln -s ${BPESRCMODEL} ${REV_WORKDIR}/train/${notdir ${BPETRGMODEL}}; \
|
||||
ln -s ${BPETRGMODEL} ${REV_WORKDIR}/train/${notdir ${BPESRCMODEL}}; \
|
||||
fi
|
||||
-if [ -e ${TRAIN_ALG} ]; then \
|
||||
if [ ! -e ${REV_WORKDIR}/train/${notdir ${TRAIN_ALG}} ]; then \
|
||||
zcat ${TRAIN_ALG} | ${MOSESSCRIPTS}/generic/reverse-alignment.perl |\
|
||||
gzip -c > ${REV_WORKDIR}/train/${notdir ${TRAIN_ALG}}; \
|
||||
fi \
|
||||
fi
|
||||
-if [ -e ${DEV_SRC}.${PRE_SRC} ]; then \
|
||||
mkdir -p ${REV_WORKDIR}/val; \
|
||||
ln -s ${DEV_SRC}.${PRE_SRC} ${REV_WORKDIR}/val/${notdir ${DEV_TRG}.${PRE_TRG}}; \
|
||||
ln -s ${DEV_TRG}.${PRE_TRG} ${REV_WORKDIR}/val/${notdir ${DEV_SRC}.${PRE_SRC}}; \
|
||||
ln -s ${DEV_SRC} ${REV_WORKDIR}/val/${notdir ${DEV_TRG}}; \
|
||||
ln -s ${DEV_TRG} ${REV_WORKDIR}/val/${notdir ${DEV_SRC}}; \
|
||||
ln -s ${DEV_SRC}.shuffled.gz ${REV_WORKDIR}/val/${notdir ${DEV_SRC}.shuffled.gz}; \
|
||||
fi
|
||||
-if [ -e ${TEST_SRC} ]; then \
|
||||
mkdir -p ${REV_WORKDIR}/test; \
|
||||
ln -s ${TEST_SRC} ${REV_WORKDIR}/test/${notdir ${TEST_TRG}}; \
|
||||
ln -s ${TEST_TRG} ${REV_WORKDIR}/test/${notdir ${TEST_SRC}}; \
|
||||
fi
|
||||
-if [ -e ${MODEL_VOCAB} ]; then \
|
||||
ln -s ${MODEL_VOCAB} ${REV_WORKDIR}/${notdir ${MODEL_VOCAB}}; \
|
||||
fi
|
||||
endif
|
||||
endif
|
||||
endif
|
||||
|
||||
|
||||
|
||||
|
||||
clean-data:
|
||||
for s in ${SRCLANGS}; do \
|
||||
for t in ${TRGLANGS}; do \
|
||||
${MAKE} SRC=$$s TRG=$$t clean-data-source; \
|
||||
done \
|
||||
done
|
||||
|
||||
clean-data-source: ${DATA_SRC} ${DATA_TRG}
|
||||
|
||||
|
||||
## word alignment used for guided alignment
|
||||
|
||||
.INTERMEDIATE: ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
|
||||
|
||||
${LOCAL_TRAIN_SRC}.algtmp: ${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz
|
||||
mkdir -p ${dir $@}
|
||||
gzip -cd < $< > $@
|
||||
|
||||
${LOCAL_TRAIN_TRG}.algtmp: ${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz
|
||||
mkdir -p ${dir $@}
|
||||
gzip -cd < $< > $@
|
||||
|
||||
|
||||
|
||||
## max number of lines in a corpus for running word alignment
|
||||
## (split into chunks of max that size before aligning)
|
||||
|
||||
MAX_WORDALIGN_SIZE = 5000000
|
||||
# MAX_WORDALIGN_SIZE = 10000000
|
||||
# MAX_WORDALIGN_SIZE = 25000000
|
||||
|
||||
${TRAIN_ALG}: ${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz \
|
||||
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz
|
||||
${MAKE} ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
|
||||
if [ `head $(LOCAL_TRAIN_SRC).algtmp | wc -l` -gt 0 ]; then \
|
||||
mkdir -p $(LOCAL_TRAIN_SRC).algtmp.d; \
|
||||
mkdir -p $(LOCAL_TRAIN_TRG).algtmp.d; \
|
||||
split -l ${MAX_WORDALIGN_SIZE} $(LOCAL_TRAIN_SRC).algtmp $(LOCAL_TRAIN_SRC).algtmp.d/; \
|
||||
split -l ${MAX_WORDALIGN_SIZE} $(LOCAL_TRAIN_TRG).algtmp $(LOCAL_TRAIN_TRG).algtmp.d/; \
|
||||
for s in `ls $(LOCAL_TRAIN_SRC).algtmp.d`; do \
|
||||
echo "align part $$s"; \
|
||||
${WORDALIGN} --overwrite \
|
||||
-s $(LOCAL_TRAIN_SRC).algtmp.d/$$s \
|
||||
-t $(LOCAL_TRAIN_TRG).algtmp.d/$$s \
|
||||
-f $(LOCAL_TRAIN_SRC).algtmp.d/$$s.fwd \
|
||||
-r $(LOCAL_TRAIN_TRG).algtmp.d/$$s.rev; \
|
||||
done; \
|
||||
echo "merge and symmetrize"; \
|
||||
cat $(LOCAL_TRAIN_SRC).algtmp.d/*.fwd > $(LOCAL_TRAIN_SRC).fwd; \
|
||||
cat $(LOCAL_TRAIN_TRG).algtmp.d/*.rev > $(LOCAL_TRAIN_TRG).rev; \
|
||||
${ATOOLS} -c grow-diag-final -i $(LOCAL_TRAIN_SRC).fwd -j $(LOCAL_TRAIN_TRG).rev |\
|
||||
gzip -c > $@; \
|
||||
rm -f ${LOCAL_TRAIN_SRC}.algtmp.d/*; \
|
||||
rm -f ${LOCAL_TRAIN_TRG}.algtmp.d/*; \
|
||||
rmdir ${LOCAL_TRAIN_SRC}.algtmp.d; \
|
||||
rmdir ${LOCAL_TRAIN_TRG}.algtmp.d; \
|
||||
rm -f $(LOCAL_TRAIN_SRC).fwd $(LOCAL_TRAIN_TRG).rev; \
|
||||
fi
|
||||
rm -f ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
|
||||
|
||||
|
||||
## old way of word alignment with all the data in one process
|
||||
## --> this may take a long time for very large corpora
|
||||
## --> may also take a lot of memory (split instead, see above)
|
||||
|
||||
# ${TRAIN_ALG}: ${TRAIN_SRC}.${PRE_SRC}${TRAINSIZE}.gz \
|
||||
# ${TRAIN_TRG}.${PRE_TRG}${TRAINSIZE}.gz
|
||||
# ${MAKE} ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
|
||||
# if [ `head $(LOCAL_TRAIN_SRC).algtmp | wc -l` -gt 0 ]; then \
|
||||
# ${WORDALIGN} -s $(LOCAL_TRAIN_SRC).algtmp -t $(LOCAL_TRAIN_TRG).algtmp \
|
||||
# --overwrite -f $(LOCAL_TRAIN_SRC).fwd -r $(LOCAL_TRAIN_TRG).rev; \
|
||||
# ${ATOOLS} -c grow-diag-final -i $(LOCAL_TRAIN_SRC).fwd -j $(LOCAL_TRAIN_TRG).rev |\
|
||||
# gzip -c > $@; \
|
||||
# fi
|
||||
# rm -f ${LOCAL_TRAIN_SRC}.algtmp ${LOCAL_TRAIN_TRG}.algtmp
|
||||
# rm -f $(LOCAL_TRAIN_SRC).fwd $(LOCAL_TRAIN_TRG).rev
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## copy OPUS data
|
||||
## (check that the OPUS file really exists! if not, create and empty file)
|
||||
##
|
||||
## TODO: should e read all data from scratch using opus_read?
|
||||
## - also: langid filtering and link prob filtering?
|
||||
|
||||
%.${SRCEXT}.raw:
|
||||
mkdir -p ${dir $@}
|
||||
c=${patsubst %.${LANGPAIR}.${SRCEXT}.raw,%,${notdir $@}}; \
|
||||
if [ -e ${OPUSHOME}/$$c/latest/moses/${LANGPAIR}.txt.zip ]; then \
|
||||
scp ${OPUSHOME}/$$c/latest/moses/${LANGPAIR}.txt.zip $@.zip; \
|
||||
unzip -d ${dir $@} $@.zip -x README LICENSE; \
|
||||
mv ${dir $@}$$c*.${LANGPAIR}.${SRCEXT} $@; \
|
||||
mv ${dir $@}$$c*.${LANGPAIR}.${TRGEXT} \
|
||||
${@:.${SRCEXT}.raw=.${TRGEXT}.raw}; \
|
||||
rm -f $@.zip ${@:.${SRCEXT}.raw=.xml} ${@:.${SRCEXT}.raw=.ids} ${dir $@}/README ${dir $@}/LICENSE; \
|
||||
elif [ -e ${OPUSHOME}/$$c/latest/xml/${LANGPAIR}.xml.gz ]; then \
|
||||
echo "extract $$c (${LANGPAIR}) from OPUS"; \
|
||||
opus_read -rd ${OPUSHOME} -d $$c -s ${SRC} -t ${TRG} -wm moses -p raw > $@.tmp; \
|
||||
cut -f1 $@.tmp > $@; \
|
||||
cut -f2 $@.tmp > ${@:.${SRCEXT}.raw=.${TRGEXT}.raw}; \
|
||||
rm -f $@.tmp; \
|
||||
else \
|
||||
touch $@; \
|
||||
touch ${@:.${SRCEXT}.raw=.${TRGEXT}.raw}; \
|
||||
fi
|
||||
|
||||
|
||||
%.${TRGEXT}.raw: %.${SRCEXT}.raw
|
||||
@echo "done!"
|
||||
|
||||
## clean data
|
||||
## OLD: apply cleanup script from Moses
|
||||
## --> this might not be a good idea before subword splitting for languages without spaces
|
||||
## NEW: do this later after splitting into subword units
|
||||
##
|
||||
## TODO:
|
||||
## - does this effect sentence piece / BPE models in some negative way?
|
||||
## - should we increase the length filter when cleaning later? How much?
|
||||
## - should we apply some other cleanup scripts here to get rid of some messy stuff?
|
||||
|
||||
|
||||
# ## this is too strict for non-latin languages
|
||||
# # grep -i '[a-zäöå0-9]' |\
|
||||
|
||||
## OLD:
|
||||
##
|
||||
# %.clean.${SRCEXT}.gz: %.${SRCEXT}.${PRE} %.${TRGEXT}.${PRE}
|
||||
# rm -f $@.${SRCEXT} $@.${TRGEXT}
|
||||
# ln -s ${word 1,$^} $@.${SRCEXT}
|
||||
# ln -s ${word 2,$^} $@.${TRGEXT}
|
||||
# $(MOSESSCRIPTS)/training/clean-corpus-n.perl $@ $(SRCEXT) $(TRGEXT) ${@:.${SRCEXT}.gz=} 0 100
|
||||
# rm -f $@.${SRCEXT} $@.${TRGEXT}
|
||||
# paste ${@:.gz=} ${@:.${SRCEXT}.gz=.${TRGEXT}} |\
|
||||
# perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' > $@.tmp
|
||||
# rm -f ${@:.gz=} ${@:.${SRCEXT}.gz=.${TRGEXT}}
|
||||
# cut -f1 $@.tmp | gzip -c > $@
|
||||
# cut -f2 $@.tmp | gzip -c > ${@:.${SRCEXT}.gz=.${TRGEXT}.gz}
|
||||
# rm -f $@.tmp
|
||||
|
||||
|
||||
# %.clean.${TRGEXT}.gz: %.clean.${SRCEXT}.gz
|
||||
# @echo "done!"
|
||||
|
||||
|
||||
%.clean.${SRCEXT}.gz: %.${SRCEXT}.${PRE} %.${TRGEXT}.${PRE}
|
||||
cat $< |\
|
||||
perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
|
||||
gzip -c > $@
|
||||
|
||||
%.clean.${TRGEXT}.gz: %.${TRGEXT}.${PRE}
|
||||
cat $< |\
|
||||
perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
|
||||
gzip -c > $@
|
||||
|
||||
|
||||
|
||||
## add training data for each language combination
|
||||
## and put it together in local space
|
||||
${LOCAL_TRAIN_SRC}: ${DEV_SRC} ${DEV_TRG}
|
||||
mkdir -p ${dir $@}
|
||||
rm -f ${LOCAL_TRAIN_SRC} ${LOCAL_TRAIN_TRG}
|
||||
-for s in ${SRCLANGS}; do \
|
||||
for t in ${TRGLANGS}; do \
|
||||
if [ ${HELDOUTSIZE} -gt 0 ]; then \
|
||||
${MAKE} DATASET=${DATASET} SRC:=$$s TRG:=$$t \
|
||||
add-to-local-train-and-heldout-data; \
|
||||
else \
|
||||
${MAKE} DATASET=${DATASET} SRC:=$$s TRG:=$$t \
|
||||
add-to-local-train-data; \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
ifeq (${USE_REST_DEVDATA},1)
|
||||
if [ -e ${DEV_SRC}.notused.gz ]; then \
|
||||
zcat ${DEV_SRC}.notused.gz >> ${LOCAL_TRAIN_SRC}; \
|
||||
zcat ${DEV_TRG}.notused.gz >> ${LOCAL_TRAIN_TRG}; \
|
||||
fi
|
||||
endif
|
||||
|
||||
|
||||
# ${MAKE} DATASET=${DATASET} SRC:=$$s TRG:=$$t add-to-local-train-data; \
|
||||
|
||||
${LOCAL_TRAIN_TRG}: ${LOCAL_TRAIN_SRC}
|
||||
@echo "done!"
|
||||
|
||||
|
||||
## add to the training data
|
||||
add-to-local-train-data: ${CLEAN_TRAIN_SRC} ${CLEAN_TRAIN_TRG}
|
||||
ifneq (${CLEAN_TRAIN_SRC},)
|
||||
echo "${CLEAN_TRAIN_SRC}" >> ${dir ${LOCAL_TRAIN_SRC}}/README
|
||||
ifneq (${words ${TRGLANGS}},1)
|
||||
echo "more than one target language";
|
||||
zcat ${CLEAN_TRAIN_SRC} |\
|
||||
sed "s/^/>>${TRG}<< /" >> ${LOCAL_TRAIN_SRC}
|
||||
else
|
||||
echo "only one target language"
|
||||
zcat ${CLEAN_TRAIN_SRC} >> ${LOCAL_TRAIN_SRC}
|
||||
endif
|
||||
zcat ${CLEAN_TRAIN_TRG} >> ${LOCAL_TRAIN_TRG}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## extract training data but keep some heldout data for each dataset
|
||||
add-to-local-train-and-heldout-data: ${CLEAN_TRAIN_SRC} ${CLEAN_TRAIN_TRG}
|
||||
ifneq (${CLEAN_TRAIN_SRC},)
|
||||
echo "${CLEAN_TRAIN_SRC}" >> ${dir ${LOCAL_TRAIN_SRC}}/README
|
||||
mkdir -p ${HELDOUT_DIR}/${SRC}-${TRG}
|
||||
ifneq (${words ${TRGLANGS}},1)
|
||||
echo "more than one target language";
|
||||
for c in ${CLEAN_TRAIN_SRC}; do \
|
||||
if (( `zcat $$c | head -$$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) | wc -l` == $$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) )); then \
|
||||
zcat $$c | tail -n +$$(($(HELDOUTSIZE) + 1)) |\
|
||||
sed "s/^/>>${TRG}<< /" >> ${LOCAL_TRAIN_SRC}; \
|
||||
zcat $$c | head -$(HELDOUTSIZE) |\
|
||||
sed "s/^/>>${TRG}<< /" | gzip -c \
|
||||
> ${HELDOUT_DIR}/${SRC}-${TRG}/`basename $$c`; \
|
||||
else \
|
||||
zcat $$c | sed "s/^/>>${TRG}<< /" >> ${LOCAL_TRAIN_SRC}; \
|
||||
fi \
|
||||
done
|
||||
else
|
||||
echo "only one target language"
|
||||
for c in ${CLEAN_TRAIN_SRC}; do \
|
||||
if (( `zcat $$c | head -$$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) | wc -l` == $$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) )); then \
|
||||
zcat $$c | tail -n +$$(($(HELDOUTSIZE) + 1)) >> ${LOCAL_TRAIN_SRC}; \
|
||||
zcat $$c | head -$(HELDOUTSIZE) |\
|
||||
gzip -c > ${HELDOUT_DIR}/${SRC}-${TRG}/`basename $$c`; \
|
||||
else \
|
||||
zcat $$c >> ${LOCAL_TRAIN_SRC}; \
|
||||
fi \
|
||||
done
|
||||
endif
|
||||
for c in ${CLEAN_TRAIN_TRG}; do \
|
||||
if (( `zcat $$c | head -$$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) | wc -l` == $$(($(HELDOUTSIZE) + $(HELDOUTSIZE))) )); then \
|
||||
zcat $$c | tail -n +$$(($(HELDOUTSIZE) + 1)) >> ${LOCAL_TRAIN_TRG}; \
|
||||
zcat $$c | head -$(HELDOUTSIZE) |\
|
||||
gzip -c > ${HELDOUT_DIR}/${SRC}-${TRG}/`basename $$c`; \
|
||||
else \
|
||||
zcat $$c >> ${LOCAL_TRAIN_TRG}; \
|
||||
fi \
|
||||
done
|
||||
endif
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
####################
|
||||
# development data
|
||||
####################
|
||||
|
||||
${DEV_SRC}.shuffled.gz:
|
||||
mkdir -p ${dir $@}
|
||||
rm -f ${DEV_SRC} ${DEV_TRG}
|
||||
-for s in ${SRCLANGS}; do \
|
||||
for t in ${TRGLANGS}; do \
|
||||
${MAKE} SRC=$$s TRG=$$t add-to-dev-data; \
|
||||
done \
|
||||
done
|
||||
paste ${DEV_SRC} ${DEV_TRG} | shuf | gzip -c > $@
|
||||
|
||||
|
||||
## if we have less than twice the amount of DEVMINSIZE in the data set
|
||||
## --> extract some data from the training data to be used as devdata
|
||||
|
||||
${DEV_SRC}: %: %.shuffled.gz
|
||||
## if we extract test and dev data from the same data set
|
||||
## ---> make sure that we do not have any overlap between the two data sets
|
||||
## ---> reserve at least DEVMINSIZE data for dev data and keep the rest for testing
|
||||
ifeq (${DEVSET},${TESTSET})
|
||||
if (( `zcat $@.shuffled.gz | wc -l` < $$((${DEVSIZE} + ${TESTSIZE})) )); then \
|
||||
if (( `zcat $@.shuffled.gz | wc -l` < $$((${DEVSMALLSIZE} + ${DEVMINSIZE})) )); then \
|
||||
echo "devset = top ${DEVMINSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README; \
|
||||
zcat $@.shuffled.gz | cut -f1 | head -${DEVMINSIZE} > ${DEV_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | head -${DEVMINSIZE} > ${DEV_TRG}; \
|
||||
mkdir -p ${dir ${TEST_SRC}}; \
|
||||
echo "testset = top ${DEVMINSIZE} lines of ../val/${notdir $@}.shuffled!" >> ${dir ${TEST_SRC}}/README; \
|
||||
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_TRG}; \
|
||||
else \
|
||||
echo "devset = top ${DEVSMALLSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README; \
|
||||
zcat $@.shuffled.gz | cut -f1 | head -${DEVSMALLSIZE} > ${DEV_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | head -${DEVSMALLSIZE} > ${DEV_TRG}; \
|
||||
mkdir -p ${dir ${TEST_SRC}}; \
|
||||
echo "testset = top ${DEVSMALLSIZE} lines of ../val/${notdir $@}.shuffled!" >> ${dir ${TEST_SRC}}/README; \
|
||||
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVSMALLSIZE} + 1)) > ${TEST_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVSMALLSIZE} + 1)) > ${TEST_TRG}; \
|
||||
fi; \
|
||||
else \
|
||||
echo "devset = top ${DEVSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README; \
|
||||
zcat $@.shuffled.gz | cut -f1 | head -${DEVSIZE} > ${DEV_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | head -${DEVSIZE} > ${DEV_TRG}; \
|
||||
mkdir -p ${dir ${TEST_SRC}}; \
|
||||
echo "testset = second top ${DEVSIZE} lines of ../val/${notdir $@}.shuffled!" >> ${dir ${TEST_SRC}}/README; \
|
||||
zcat $@.shuffled.gz | cut -f1 | head -$$((${DEVSIZE} + ${TESTSIZE})) | tail -${TESTSIZE} > ${TEST_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | head -$$((${DEVSIZE} + ${TESTSIZE})) | tail -${TESTSIZE} > ${TEST_TRG}; \
|
||||
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVSIZE} + ${TESTSIZE})) | gzip -c > ${DEV_SRC}.notused.gz; \
|
||||
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVSIZE} + ${TESTSIZE})) | gzip -c > ${DEV_TRG}.notused.gz; \
|
||||
fi
|
||||
else
|
||||
echo "devset = top ${DEVSIZE} lines of ${notdir $@}.shuffled!" >> ${dir $@}/README
|
||||
zcat $@.shuffled.gz | cut -f1 | head -${DEVSIZE} > ${DEV_SRC}
|
||||
zcat $@.shuffled.gz | cut -f2 | head -${DEVSIZE} > ${DEV_TRG}
|
||||
zcat $@.shuffled.gz | cut -f1 | tail -n +$$((${DEVSIZE} + 1)) | gzip -c > ${DEV_SRC}.notused.gz
|
||||
zcat $@.shuffled.gz | cut -f2 | tail -n +$$((${DEVSIZE} + 1)) | gzip -c > ${DEV_TRG}.notused.gz
|
||||
endif
|
||||
|
||||
# zcat $@.shuffled.gz | cut -f1 | tail -${TESTSIZE} > ${TEST_SRC}; \
|
||||
# zcat $@.shuffled.gz | cut -f2 | tail -${TESTSIZE} > ${TEST_TRG}; \
|
||||
|
||||
|
||||
${DEV_TRG}: ${DEV_SRC}
|
||||
@echo "done!"
|
||||
|
||||
|
||||
|
||||
### OLD: extract data from training data as dev/test set if the devdata is too small
|
||||
### ---> this is confusing - skip this
|
||||
###
|
||||
### otherwise copy this directly after the target for ${DEV_SRC} above!
|
||||
### and add dependency on train-data for ${DEV_SRC}.shuffled.gz like this:
|
||||
### ${DEV_SRC}.shuffled.gz: ${TRAIN_SRC}.${PRE_SRC}.gz ${TRAIN_TRG}.${PRE_TRG}.gz
|
||||
### and remove dependency on dev-data for ${LOCAL_TRAIN_SRC}, change
|
||||
### ${LOCAL_TRAIN_SRC}: ${DEV_SRC} ${DEV_TRG} to
|
||||
### ${LOCAL_TRAIN_SRC}:
|
||||
#
|
||||
# if (( `zcat $@.shuffled.gz | wc -l` < $$((${DEVMINSIZE} + ${DEVMINSIZE})) )); then \
|
||||
# echo "Need more devdata - take some from traindata!"; \
|
||||
# echo ".......... (1) extract top $$((${DEVSIZE} + ${TESTSIZE})) lines"; \
|
||||
# echo "Too little dev/test data in ${DEVSET}!" >> ${dir $@}/README; \
|
||||
# echo "Add top $$((${DEVSIZE} + ${TESTSIZE})) lines from ${DATASET} to dev/test" >> ${dir $@}/README; \
|
||||
# echo "and remove those lines from training data" >> ${dir $@}/README; \
|
||||
# zcat ${TRAIN_SRC}.${PRE_SRC}.gz | \
|
||||
# head -$$((${DEVSIZE} + ${TESTSIZE})) | \
|
||||
# sed 's/\@\@ //g' > $@.extra.${SRC}; \
|
||||
# zcat ${TRAIN_TRG}.${PRE_TRG}.gz | \
|
||||
# head -$$((${DEVSIZE} + ${TESTSIZE})) | \
|
||||
# sed 's/\@\@ //g' > $@.extra.${TRG}; \
|
||||
# echo ".......... (2) remaining lines for training"; \
|
||||
# zcat ${TRAIN_SRC}.${PRE_SRC}.gz | \
|
||||
# tail -n +$$((${DEVSIZE} + ${TESTSIZE} + 1)) | \
|
||||
# sed 's/\@\@ //g' | gzip -c > $@.remaining.${SRC}.gz; \
|
||||
# zcat ${TRAIN_TRG}.${PRE_TRG}.gz | \
|
||||
# tail -n +$$((${DEVSIZE} + ${TESTSIZE} + 1)) | \
|
||||
# sed 's/\@\@ //g' | gzip -c > $@.remaining.${TRG}.gz; \
|
||||
# mv -f $@.remaining.${SRC}.gz ${TRAIN_SRC}.${PRE_SRC}.gz; \
|
||||
# mv -f $@.remaining.${TRG}.gz ${TRAIN_TRG}.${PRE_TRG}.gz; \
|
||||
# echo ".......... (3) append to devdata"; \
|
||||
# mv $@.shuffled.gz $@.oldshuffled.gz; \
|
||||
# paste $@.extra.${SRC} $@.extra.${TRG} > $@.shuffled; \
|
||||
# zcat $@.oldshuffled.gz >> $@.shuffled; \
|
||||
# rm $@.oldshuffled.gz; \
|
||||
# gzip -f $@.shuffled; \
|
||||
# rm -f $@.extra.${SRC} $@.extra.${TRG}; \
|
||||
# fi
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
add-to-dev-data: ${CLEAN_DEV_SRC} ${CLEAN_DEV_TRG}
|
||||
ifneq (${CLEAN_DEV_SRC},)
|
||||
ifneq (${words ${TRGLANGS}},1)
|
||||
echo "more than one target language";
|
||||
zcat ${CLEAN_DEV_SRC} |\
|
||||
sed "s/^/>>${TRG}<< /" >> ${DEV_SRC}
|
||||
else
|
||||
echo "only one target language"
|
||||
zcat ${CLEAN_DEV_SRC} >> ${DEV_SRC}
|
||||
endif
|
||||
zcat ${CLEAN_DEV_TRG} >> ${DEV_TRG}
|
||||
endif
|
||||
|
||||
|
||||
####################
|
||||
# test data
|
||||
####################
|
||||
##
|
||||
## if devset and testset are from the same source:
|
||||
## --> use part of the shuffled devset
|
||||
## otherwise: create the testset
|
||||
## exception: TESTSET exists in TESTSET_DIR
|
||||
## --> just use that one
|
||||
|
||||
${TEST_SRC}: ${DEV_SRC}
|
||||
ifneq (${TESTSET},${DEVSET})
|
||||
mkdir -p ${dir $@}
|
||||
rm -f ${TEST_SRC} ${TEST_TRG}
|
||||
if [ -e ${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz ]; then \
|
||||
${MAKE} CLEAN_TEST_SRC=${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz \
|
||||
CLEAN_TEST_TRG=${TESTSET_DIR}/${TESTSET}.${TRG}.${PRE}.gz \
|
||||
add-to-test-data; \
|
||||
else \
|
||||
for s in ${SRCLANGS}; do \
|
||||
for t in ${TRGLANGS}; do \
|
||||
${MAKE} SRC=$$s TRG=$$t add-to-test-data; \
|
||||
done \
|
||||
done; \
|
||||
if [ ${TESTSIZE} -lt `cat $@ | wc -l` ]; then \
|
||||
paste ${TEST_SRC} ${TEST_TRG} | shuf | gzip -c > $@.shuffled.gz; \
|
||||
zcat $@.shuffled.gz | cut -f1 | tail -${TESTSIZE} > ${TEST_SRC}; \
|
||||
zcat $@.shuffled.gz | cut -f2 | tail -${TESTSIZE} > ${TEST_TRG}; \
|
||||
echo "testset = top ${TESTSIZE} lines of $@.shuffled!" >> ${dir $@}/README; \
|
||||
fi \
|
||||
fi
|
||||
else
|
||||
mkdir -p ${dir $@}
|
||||
if [ -e ${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz ]; then \
|
||||
${MAKE} CLEAN_TEST_SRC=${TESTSET_DIR}/${TESTSET}.${SRC}.${PRE}.gz \
|
||||
CLEAN_TEST_TRG=${TESTSET_DIR}/${TESTSET}.${TRG}.${PRE}.gz \
|
||||
add-to-test-data; \
|
||||
elif (( `zcat $<.shuffled.gz | wc -l` < $$((${DEVSIZE} + ${TESTSIZE})) )); then \
|
||||
zcat $<.shuffled.gz | cut -f1 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_SRC}; \
|
||||
zcat $<.shuffled.gz | cut -f2 | tail -n +$$((${DEVMINSIZE} + 1)) > ${TEST_TRG}; \
|
||||
else \
|
||||
zcat $<.shuffled.gz | cut -f1 | tail -${TESTSIZE} > ${TEST_SRC}; \
|
||||
zcat $<.shuffled.gz | cut -f2 | tail -${TESTSIZE} > ${TEST_TRG}; \
|
||||
fi
|
||||
endif
|
||||
|
||||
${TEST_TRG}: ${TEST_SRC}
|
||||
@echo "done!"
|
||||
|
||||
add-to-test-data: ${CLEAN_TEST_SRC}
|
||||
ifneq (${CLEAN_TEST_SRC},)
|
||||
ifneq (${words ${TRGLANGS}},1)
|
||||
echo "more than one target language";
|
||||
zcat ${CLEAN_TEST_SRC} |\
|
||||
sed "s/^/>>${TRG}<< /" >> ${TEST_SRC}
|
||||
else
|
||||
echo "only one target language"
|
||||
zcat ${CLEAN_TEST_SRC} >> ${TEST_SRC}
|
||||
endif
|
||||
zcat ${CLEAN_TEST_TRG} >> ${TEST_TRG}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
## reduce training data size if necessary
|
||||
ifdef TRAINSIZE
|
||||
${TRAIN_SRC}.clean.${PRE_SRC}${TRAINSIZE}.gz: ${TRAIN_SRC}.clean.${PRE_SRC}.gz
|
||||
zcat $< | head -${TRAINSIZE} | gzip -c > $@
|
||||
|
||||
${TRAIN_TRG}.clean.${PRE_TRG}${TRAINSIZE}.gz: ${TRAIN_TRG}.clean.${PRE_TRG}.gz
|
||||
zcat $< | head -${TRAINSIZE} | gzip -c > $@
|
||||
endif
|
||||
|
||||
|
||||
# %.clean.gz: %.gz
|
||||
# mkdir -p ${TMPDIR}/${LANGSTR}/cleanup
|
||||
# gzip -cd < $< > ${TMPDIR}/${LANGSTR}/cleanup/$(notdir $@).${SRCEXT}
|
||||
|
||||
|
||||
########################
|
||||
# tune data
|
||||
# TODO: do we use this?
|
||||
########################
|
||||
|
||||
${TUNE_SRC}: ${TRAIN_SRC}
|
||||
mkdir -p ${dir $@}
|
||||
rm -f ${TUNE_SRC} ${TUNE_TRG}
|
||||
-for s in ${SRCLANGS}; do \
|
||||
for t in ${TRGLANGS}; do \
|
||||
${MAKE} SRC=$$s TRG=$$t add-to-tune-data; \
|
||||
done \
|
||||
done
|
||||
|
||||
${TUNE_TRG}: ${TUNE_SRC}
|
||||
@echo "done!"
|
||||
|
||||
add-to-tune-data: ${CLEAN_TUNE_SRC}
|
||||
ifneq (${CLEAN_TUNE_SRC},)
|
||||
ifneq (${words ${TRGLANGS}},1)
|
||||
echo "more than one target language";
|
||||
zcat ${CLEAN_TUNE_SRC} |\
|
||||
sed "s/^/>>${TRG}<< /" >> ${TUNE_SRC}
|
||||
else
|
||||
echo "only one target language"
|
||||
zcat ${CLEAN_TUNE_SRC} >> ${TUNE_SRC}
|
||||
endif
|
||||
zcat ${CLEAN_TUNE_TRG} >> ${TUNE_TRG}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
##----------------------------------------------
|
||||
## tokenization
|
||||
##----------------------------------------------
|
||||
|
||||
|
||||
## normalisation for Chinese
|
||||
%.zh_tw.tok: %.zh_tw.raw
|
||||
$(LOAD_MOSES) cat $< |\
|
||||
$(TOKENIZER)/replace-unicode-punctuation.perl |\
|
||||
$(TOKENIZER)/remove-non-printing-char.perl |\
|
||||
$(TOKENIZER)/normalize-punctuation.perl |\
|
||||
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
|
||||
|
||||
%.zh_cn.tok: %.zh_cn.raw
|
||||
$(LOAD_MOSES) cat $< |\
|
||||
$(TOKENIZER)/replace-unicode-punctuation.perl |\
|
||||
$(TOKENIZER)/remove-non-printing-char.perl |\
|
||||
$(TOKENIZER)/normalize-punctuation.perl |\
|
||||
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
|
||||
|
||||
%.zh.tok: %.zh.raw
|
||||
$(LOAD_MOSES) cat $< |\
|
||||
$(TOKENIZER)/replace-unicode-punctuation.perl |\
|
||||
$(TOKENIZER)/remove-non-printing-char.perl |\
|
||||
$(TOKENIZER)/normalize-punctuation.perl |\
|
||||
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
|
||||
|
||||
## generic target for tokenization
|
||||
%.tok: %.raw
|
||||
$(LOAD_MOSES) cat $< |\
|
||||
$(TOKENIZER)/replace-unicode-punctuation.perl |\
|
||||
$(TOKENIZER)/remove-non-printing-char.perl |\
|
||||
$(TOKENIZER)/normalize-punctuation.perl \
|
||||
-l ${lastword ${subst 1,,${subst 2,,${subst ., ,$(<:.raw=)}}}} |\
|
||||
$(TOKENIZER)/tokenizer.perl -a -threads $(THREADS) \
|
||||
-l ${lastword ${subst 1,,${subst 2,,${subst ., ,$(<:.raw=)}}}} |\
|
||||
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
|
||||
|
||||
|
||||
|
||||
## only normalisation
|
||||
%.norm: %.raw
|
||||
$(LOAD_MOSES) cat $< |\
|
||||
$(TOKENIZER)/replace-unicode-punctuation.perl |\
|
||||
$(TOKENIZER)/remove-non-printing-char.perl |\
|
||||
$(TOKENIZER)/normalize-punctuation.perl |\
|
||||
sed 's/ */ /g;s/^ *//g;s/ *$$//g' > $@
|
||||
|
||||
%.norm.gz: %.gz
|
||||
$(LOAD_MOSES) zcat $< |\
|
||||
$(TOKENIZER)/replace-unicode-punctuation.perl |\
|
||||
$(TOKENIZER)/remove-non-printing-char.perl |\
|
||||
$(TOKENIZER)/normalize-punctuation.perl |\
|
||||
sed 's/ */ /g;s/^ *//g;s/ *$$//g' | gzip -c > $@
|
||||
|
||||
|
||||
## increase max number of tokens to 250
|
||||
## (TODO: should MIN_NTOKENS be 1?)
|
||||
MIN_NR_TOKENS = 0
|
||||
MAX_NR_TOKENS = 250
|
||||
|
||||
## apply the cleanup script from Moses
|
||||
%.src.clean.${PRE_SRC}: %.src.${PRE_SRC} %.trg.${PRE_TRG}
|
||||
rm -f $@.${SRCEXT} $<.${TRGEXT}
|
||||
ln -s ${word 1,$^} $<.${SRCEXT}
|
||||
ln -s ${word 2,$^} $<.${TRGEXT}
|
||||
$(MOSESSCRIPTS)/training/clean-corpus-n.perl $< $(SRCEXT) $(TRGEXT) $@ ${MIN_NR_TOKENS} ${MAX_NR_TOKENS}
|
||||
rm -f $<.${SRCEXT} $<.${TRGEXT}
|
||||
mv $@.${SRCEXT} $@
|
||||
mv $@.${TRGEXT} $(@:.src.clean.${PRE_SRC}=.trg.clean.${PRE_TRG})
|
||||
|
||||
%.trg.clean.${PRE_TRG}: %.src.clean.${PRE_SRC}
|
||||
@echo "done!"
|
||||
|
||||
|
||||
# tokenize testsets
|
||||
testsets/%.raw: testsets/%.gz
|
||||
gzip -cd < $< > $@
|
||||
|
||||
testsets/%.${PRE}.gz: testsets/%.${PRE}
|
||||
gzip -c < $< > $@
|
||||
|
||||
ALLTEST = $(patsubst %.gz,%.${PRE}.gz,${sort $(subst .${PRE},,${wildcard testsets/*/*.??.gz})})
|
||||
|
||||
tokenize-testsets prepare-testsets: ${ALLTEST}
|
||||
|
||||
|
||||
##----------------------------------------------
|
||||
## BPE
|
||||
##----------------------------------------------
|
||||
|
||||
## source/target specific bpe
|
||||
## - make sure to leave the language flags alone!
|
||||
## - make sure that we do not delete the BPE code files
|
||||
## if the BPE models already exist
|
||||
## ---> do not create new ones and always keep the old ones
|
||||
## ---> need to delete the old ones if we want to create new BPE models
|
||||
|
||||
|
||||
BPESRCMODEL = ${TRAIN_SRC}.bpe${SRCBPESIZE:000=}k-model
|
||||
BPETRGMODEL = ${TRAIN_TRG}.bpe${TRGBPESIZE:000=}k-model
|
||||
|
||||
.PRECIOUS: ${BPESRCMODEL} ${BPETRGMODEL}
|
||||
.INTERMEDIATE: ${LOCAL_TRAIN_SRC} ${LOCAL_TRAIN_TRG}
|
||||
|
||||
${BPESRCMODEL}: ${WORKDIR}/%.bpe${SRCBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
|
||||
ifeq ($(wildcard ${BPESRCMODEL}),)
|
||||
mkdir -p ${dir $@}
|
||||
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
|
||||
python3 ${SNMTPATH}/learn_bpe.py -s $(SRCBPESIZE) < $< > $@
|
||||
else
|
||||
cut -f2- -d ' ' $< > $<.text
|
||||
python3 ${SNMTPATH}/learn_bpe.py -s $(SRCBPESIZE) < $<.text > $@
|
||||
rm -f $<.text
|
||||
endif
|
||||
else
|
||||
@echo "$@ already exists!"
|
||||
@echo "WARNING! No new BPE model is created even though the data has changed!"
|
||||
@echo "WARNING! Delete the file if you want to start from scratch!"
|
||||
touch $@
|
||||
endif
|
||||
|
||||
## no labels on the target language side
|
||||
${BPETRGMODEL}: ${WORKDIR}/%.bpe${TRGBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
|
||||
ifeq ($(wildcard ${BPETRGMODEL}),)
|
||||
mkdir -p ${dir $@}
|
||||
python3 ${SNMTPATH}/learn_bpe.py -s $(TRGBPESIZE) < $< > $@
|
||||
else
|
||||
@echo "$@ already exists!"
|
||||
@echo "WARNING! No new BPE codes are created!"
|
||||
@echo "WARNING! Delete the file if you want to start from scratch!"
|
||||
touch $@
|
||||
endif
|
||||
|
||||
|
||||
|
||||
%.src.bpe${SRCBPESIZE:000=}k: %.src ${BPESRCMODEL}
|
||||
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
|
||||
python3 ${SNMTPATH}/apply_bpe.py -c $(word 2,$^) < $< > $@
|
||||
else
|
||||
cut -f1 -d ' ' $< > $<.labels
|
||||
cut -f2- -d ' ' $< > $<.text
|
||||
python3 ${SNMTPATH}/apply_bpe.py -c $(word 2,$^) < $<.text > $@.text
|
||||
paste -d ' ' $<.labels $@.text > $@
|
||||
rm -f $<.labels $<.text $@.text
|
||||
endif
|
||||
|
||||
%.trg.bpe${TRGBPESIZE:000=}k: %.trg ${BPETRGMODEL}
|
||||
python3 ${SNMTPATH}/apply_bpe.py -c $(word 2,$^) < $< > $@
|
||||
|
||||
|
||||
## this places @@ markers in front of punctuations
|
||||
## if they appear to the right of the segment boundary
|
||||
## (useful if we use BPE without tokenization)
|
||||
%.segfix: %
|
||||
perl -pe 's/(\P{P})\@\@ (\p{P})/$$1 \@\@$$2/g' < $< > $@
|
||||
|
||||
|
||||
|
||||
%.trg.txt: %.trg
|
||||
mkdir -p ${dir $@}
|
||||
mv $< $@
|
||||
|
||||
%.src.txt: %.src
|
||||
mkdir -p ${dir $@}
|
||||
mv $< $@
|
||||
|
||||
|
||||
|
||||
|
||||
##----------------------------------------------
|
||||
## sentence piece
|
||||
##----------------------------------------------
|
||||
|
||||
|
||||
SPMSRCMODEL = ${TRAIN_SRC}.spm${SRCBPESIZE:000=}k-model
|
||||
SPMTRGMODEL = ${TRAIN_TRG}.spm${TRGBPESIZE:000=}k-model
|
||||
|
||||
.PRECIOUS: ${SPMSRCMODEL} ${SPMTRGMODEL}
|
||||
|
||||
|
||||
|
||||
${SPMSRCMODEL}: ${WORKDIR}/%.spm${SRCBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
|
||||
ifeq ($(wildcard ${SPMSRCMODEL}),)
|
||||
mkdir -p ${dir $@}
|
||||
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
|
||||
grep . $< > $<.text
|
||||
${SPM_HOME}/spm_train \
|
||||
--model_prefix=$@ --vocab_size=$(SRCBPESIZE) --input=$<.text \
|
||||
--character_coverage=1.0 --hard_vocab_limit=false
|
||||
else
|
||||
cut -f2- -d ' ' $< | grep . > $<.text
|
||||
${SPM_HOME}/spm_train \
|
||||
--model_prefix=$@ --vocab_size=$(SRCBPESIZE) --input=$<.text \
|
||||
--character_coverage=1.0 --hard_vocab_limit=false
|
||||
endif
|
||||
mv $@.model $@
|
||||
rm -f $<.text
|
||||
else
|
||||
@echo "$@ already exists!"
|
||||
@echo "WARNING! No new SPM model is created even though the data has changed!"
|
||||
@echo "WARNING! Delete the file if you want to start from scratch!"
|
||||
touch $@
|
||||
endif
|
||||
|
||||
## no labels on the target language side
|
||||
${SPMTRGMODEL}: ${WORKDIR}/%.spm${TRGBPESIZE:000=}k-model: ${TMPDIR}/${LANGSTR}/%
|
||||
ifeq ($(wildcard ${SPMTRGMODEL}),)
|
||||
mkdir -p ${dir $@}
|
||||
grep . $< > $<.text
|
||||
${SPM_HOME}/spm_train \
|
||||
--model_prefix=$@ --vocab_size=$(TRGBPESIZE) --input=$<.text \
|
||||
--character_coverage=1.0 --hard_vocab_limit=false
|
||||
mv $@.model $@
|
||||
rm -f $<.text
|
||||
else
|
||||
@echo "$@ already exists!"
|
||||
@echo "WARNING! No new SPM model created!"
|
||||
@echo "WARNING! Delete the file if you want to start from scratch!"
|
||||
touch $@
|
||||
endif
|
||||
|
||||
|
||||
|
||||
%.src.spm${SRCBPESIZE:000=}k: %.src ${SPMSRCMODEL}
|
||||
ifeq ($(TRGLANGS),${firstword ${TRGLANGS}})
|
||||
${SPM_HOME}/spm_encode --model $(word 2,$^) < $< > $@
|
||||
else
|
||||
cut -f1 -d ' ' $< > $<.labels
|
||||
cut -f2- -d ' ' $< > $<.text
|
||||
${SPM_HOME}/spm_encode --model $(word 2,$^) < $<.text > $@.text
|
||||
paste -d ' ' $<.labels $@.text > $@
|
||||
rm -f $<.labels $<.text $@.text
|
||||
endif
|
||||
|
||||
%.trg.spm${TRGBPESIZE:000=}k: %.trg ${SPMTRGMODEL}
|
||||
${SPM_HOME}/spm_encode --model $(word 2,$^) < $< > $@
|
||||
|
||||
|
||||
## document-level models (with guided alignment)
|
||||
%.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz:
|
||||
${MAKE} PRE_SRC=spm${SRCBPESIZE:000=}k PRE_TRG=spm${TRGBPESIZE:000=}k wordalign
|
||||
./large-context.pl -l ${CONTEXT_SIZE} \
|
||||
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,%.src.spm${SRCBPESIZE:000=}k.gz,$@} \
|
||||
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,%.trg.spm${TRGBPESIZE:000=}k.gz,$@} \
|
||||
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,%.spm${SRCBPESIZE:000=}k-spm${TRGBPESIZE:000=}k.src-trg.alg.gz,$@} \
|
||||
| gzip > $@.tmp.gz
|
||||
zcat $@.tmp.gz | cut -f1 | gzip -c > $@
|
||||
zcat $@.tmp.gz | cut -f2 | gzip -c > ${subst .src.,.trg.,$@}
|
||||
zcat $@.tmp.gz | cut -f3 | \
|
||||
gzip -c > ${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz,\
|
||||
%.spm${SRCBPESIZE:000=}k.doc${CONTEXT_SIZE}-spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.src-trg.alg.gz,$@}
|
||||
rm -f $@.tmp.gz
|
||||
|
||||
%.trg.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz: %.src.spm${SRCBPESIZE:000=}k.doc${CONTEXT_SIZE}.gz
|
||||
@echo "done!"
|
||||
|
||||
|
||||
|
||||
## for validation and test data:
|
||||
%.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}:
|
||||
${MAKE} PRE_SRC=spm${SRCBPESIZE:000=}k PRE_TRG=spm${TRGBPESIZE:000=}k devdata
|
||||
${MAKE} PRE_SRC=spm${SRCBPESIZE:000=}k PRE_TRG=spm${TRGBPESIZE:000=}k testdata
|
||||
./large-context.pl -l ${CONTEXT_SIZE} \
|
||||
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE},%.src.spm${SRCBPESIZE:000=}k,$@} \
|
||||
${patsubst %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE},%.trg.spm${TRGBPESIZE:000=}k,$@} \
|
||||
| gzip > $@.tmp.gz
|
||||
zcat $@.tmp.gz | cut -f1 > $@
|
||||
zcat $@.tmp.gz | cut -f2 > ${subst .src.,.trg.,$@}
|
||||
rm -f $@.tmp.gz
|
||||
|
||||
%.trg.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}: %.src.spm${TRGBPESIZE:000=}k.doc${CONTEXT_SIZE}
|
||||
@echo "done!"
|
||||
|
||||
|
||||
|
||||
##----------------------------------------------
|
||||
## get data from local space and compress ...
|
||||
|
||||
${WORKDIR}/%.clean.${PRE_SRC}.gz: ${TMPDIR}/${LANGSTR}/%.clean.${PRE_SRC}
|
||||
mkdir -p ${dir $@}
|
||||
gzip -c < $< > $@
|
||||
|
||||
ifneq (${PRE_SRC},${PRE_TRG})
|
||||
${WORKDIR}/%.clean.${PRE_TRG}.gz: ${TMPDIR}/${LANGSTR}/%.clean.${PRE_TRG}
|
||||
mkdir -p ${dir $@}
|
||||
gzip -c < $< > $@
|
||||
endif
|
115
Makefile.def
Normal file
115
Makefile.def
Normal file
@ -0,0 +1,115 @@
|
||||
# -*-makefile-*-
|
||||
|
||||
# enable e-mail notification by setting EMAIL
|
||||
|
||||
WHOAMI = $(shell whoami)
|
||||
ifeq ("$(WHOAMI)","tiedeman")
|
||||
EMAIL = jorg.tiedemann@helsinki.fi
|
||||
endif
|
||||
|
||||
# job-specific settings (overwrite if necessary)
|
||||
# HPC_EXTRA: additional SBATCH commands
|
||||
|
||||
CPU_MODULES = gcc/6.2.0 mkl
|
||||
GPU_MODULES = cuda-env/8 mkl
|
||||
# GPU_MODULES = python-env/3.5.3-ml cuda-env/8 mkl
|
||||
|
||||
|
||||
|
||||
# GPU = k80
|
||||
GPU = p100
|
||||
NR_GPUS = 1
|
||||
HPC_MEM = 8g
|
||||
HPC_NODES = 1
|
||||
HPC_CORES = 1
|
||||
HPC_DISK = 500
|
||||
HPC_QUEUE = serial
|
||||
# HPC_MODULES = nlpl-opus python-env/3.4.1 efmaral moses
|
||||
# HPC_MODULES = nlpl-opus moses cuda-env marian python-3.5.3-ml
|
||||
HPC_MODULES = ${GPU_MODULES}
|
||||
HPC_EXTRA =
|
||||
WALLTIME = 72
|
||||
|
||||
|
||||
DEVICE = cuda
|
||||
LOADCPU = module load ${CPU_MODULES}
|
||||
LOADGPU = module load ${GPU_MODULES}
|
||||
|
||||
MARIAN_WORKSPACE = 13000
|
||||
|
||||
ifeq (${shell hostname},dx6-ibs-p2)
|
||||
APPLHOME = /opt/tools
|
||||
WORKHOME = ${shell realpath ${PWD}/work}
|
||||
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
|
||||
MOSESHOME = ${APPLHOME}/mosesdecoder
|
||||
MARIAN = ${APPLHOME}/marian/build
|
||||
LOADMODS = echo "nothing to load"
|
||||
MARIAN_WORKSPACE = 10000
|
||||
else ifeq (${shell hostname},dx7-nkiel-4gpu)
|
||||
APPLHOME = /opt/tools
|
||||
WORKHOME = ${shell realpath ${PWD}/work}
|
||||
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
|
||||
MOSESHOME = ${APPLHOME}/mosesdecoder
|
||||
MARIAN = ${APPLHOME}/marian/build
|
||||
LOADMODS = echo "nothing to load"
|
||||
MARIAN_WORKSPACE = 10000
|
||||
else ifneq ($(wildcard /wrk/tiedeman/research/),)
|
||||
DATAHOME = /proj/OPUS/WMT19/data/${LANGPAIR}
|
||||
# APPLHOME = ${USERAPPL}/tools
|
||||
APPLHOME = /proj/memad/tools
|
||||
WORKHOME = /wrk/tiedeman/research/marian/${SRC}-${TRG}
|
||||
OPUSHOME = /proj/nlpl/data/OPUS
|
||||
MOSESHOME = /proj/nlpl/software/moses/4.0-65c75ff/moses
|
||||
# MARIAN = /proj/nlpl/software/marian/1.2.0
|
||||
# MARIAN = /appl/ling/marian
|
||||
MARIAN = ${HOME}/appl_taito/tools/marian/build-gpu
|
||||
MARIANCPU = ${HOME}/appl_taito/tools/marian/build-cpu
|
||||
LOADMODS = ${LOADGPU}
|
||||
else
|
||||
# CSCPROJECT = project_2001194
|
||||
CSCPROJECT = project_2000309
|
||||
DATAHOME = ${HOME}/work/opentrans/data/${LANGPAIR}
|
||||
WORKHOME = ${shell realpath ${PWD}/work}
|
||||
APPLHOME = ${HOME}/projappl
|
||||
OPUSHOME = /scratch/project_2000661/nlpl/data/OPUS
|
||||
MOSESHOME = ${APPLHOME}/mosesdecoder
|
||||
EFLOMAL_HOME = ${APPLHOME}/eflomal/
|
||||
MARIAN = ${APPLHOME}/marian/build
|
||||
MARIANCPU = ${APPLHOME}/marian/build
|
||||
# GPU_MODULES = cuda intel-mkl
|
||||
GPU = v100
|
||||
GPU_MODULES = python-env
|
||||
CPU_MODULES = python-env
|
||||
LOADMODS = echo "nothing to load"
|
||||
HPC_QUEUE = small
|
||||
MARIAN_WORKSPACE = 30000
|
||||
endif
|
||||
|
||||
|
||||
ifdef LOCAL_SCRATCH
|
||||
TMPDIR = ${LOCAL_SCRATCH}
|
||||
endif
|
||||
|
||||
|
||||
WORDALIGN = ${EFLOMAL_HOME}align.py
|
||||
ATOOLS = ${FASTALIGN_HOME}atools
|
||||
|
||||
MULTEVALHOME = ${APPLHOME}/multeval
|
||||
MOSESSCRIPTS = ${MOSESHOME}/scripts
|
||||
TOKENIZER = ${MOSESSCRIPTS}/tokenizer
|
||||
SNMTPATH = ${APPLHOME}/subword-nmt/subword_nmt
|
||||
|
||||
|
||||
|
||||
# sorted languages and langpair used to match resources in OPUS
|
||||
SORTLANGS = $(sort ${SRC} ${TRG})
|
||||
LANGPAIR = ${firstword ${SORTLANGS}}-${lastword ${SORTLANGS}}
|
||||
|
||||
## for same language pairs: add numeric extension
|
||||
ifeq (${SRC},$(TRG))
|
||||
SRCEXT = ${SRC}1
|
||||
TRGEXT = ${SRC}2
|
||||
else
|
||||
SRCEXT = ${SRC}
|
||||
TRGEXT = ${TRG}
|
||||
endif
|
351
Makefile.dist
Normal file
351
Makefile.dist
Normal file
@ -0,0 +1,351 @@
|
||||
# -*-makefile-*-
|
||||
#
|
||||
# make distribution packages
|
||||
# and upload them to cPouta ObjectStorage
|
||||
#
|
||||
|
||||
MODELSHOME = ${WORKHOME}/models
|
||||
DIST_PACKAGE = ${MODELSHOME}/${LANGSTR}/${DATASET}.zip
|
||||
|
||||
|
||||
## minimum BLEU score for models to be accepted as distribution package
|
||||
MIN_BLEU_SCORE = 20
|
||||
|
||||
.PHONY: dist
|
||||
dist: ${DIST_PACKAGE}
|
||||
|
||||
.PHONY: scores
|
||||
scores: ${WORKHOME}/eval/scores.txt
|
||||
|
||||
|
||||
|
||||
## get the best model from all kind of alternative setups
|
||||
## in the following sub directories (add prefix work-)
|
||||
|
||||
# ALT_MODEL_DIR = bpe-old bpe-memad bpe spm-noalign bpe-align spm
|
||||
ALT_MODEL_DIR = spm
|
||||
|
||||
best_dist_all:
|
||||
for l in $(sort ${shell ls work* | grep -- '-' | grep -v old | grep -v work}); do \
|
||||
if [ `find work*/$$l -name '${DATASET}${TRAINSIZE}.*.npz' | wc -l` -gt 0 ]; then \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" best_dist; \
|
||||
fi \
|
||||
done
|
||||
|
||||
|
||||
## find the best model according to test set scores
|
||||
## and make a distribution package from that model
|
||||
## (BLEU needs to be above MIN_BLEU_SCORE)
|
||||
## NEW: don't trust models tested with GNOME test sets!
|
||||
|
||||
best_dist:
|
||||
@m=0;\
|
||||
s=''; \
|
||||
echo "------------------------------------------------"; \
|
||||
echo "search best model for ${LANGSTR}"; \
|
||||
for d in ${ALT_MODEL_DIR}; do \
|
||||
e=`ls work-$$d/${LANGSTR}/val/*.trg | xargs basename | sed 's/\.trg//'`; \
|
||||
echo "evaldata = $$e"; \
|
||||
if [ "$$e" != "GNOME" ]; then \
|
||||
if ls work-$$d/${LANGSTR}/$$e*.eval 1> /dev/null 2>&1; then \
|
||||
b=`grep 'BLEU+' work-$$d/${LANGSTR}/$$e*.eval | cut -f3 -d' '`; \
|
||||
if (( $$(echo "$$m-$$b < 1" |bc -l) )); then \
|
||||
echo "$$d ($$b) is better or not much worse than $$s ($$m)!"; \
|
||||
m=$$b; \
|
||||
s=$$d; \
|
||||
else \
|
||||
echo "$$d ($$b) is worse than $$s ($$m)!"; \
|
||||
fi \
|
||||
fi \
|
||||
fi \
|
||||
done; \
|
||||
echo "------------------------------------------------"; \
|
||||
if [ "$$s" != "" ]; then \
|
||||
if (( $$(echo "$$m > ${MIN_BLEU_SCORE}" |bc -l) )); then \
|
||||
${MAKE} MODELSHOME=${PWD}/models \
|
||||
MODELS_URL=https://object.pouta.csc.fi/OPUS-MT-models dist-$$s; \
|
||||
fi; \
|
||||
fi
|
||||
|
||||
|
||||
|
||||
## make a package for distribution
|
||||
|
||||
## old: only accept models with a certain evaluation score:
|
||||
# if [ `grep BLEU $(TEST_EVALUATION) | cut -f3 -d ' ' | cut -f1 -d '.'` -ge ${MIN_BLEU_SCORE} ]; then \
|
||||
|
||||
DATE = ${shell date +%F}
|
||||
MODELS_URL = https://object.pouta.csc.fi/OPUS-MT-dev
|
||||
SKIP_DIST_EVAL = 0
|
||||
|
||||
${DIST_PACKAGE}: ${MODEL_FINAL}
|
||||
ifneq (${SKIP_DIST_EVAL},1)
|
||||
@${MAKE} $(TEST_EVALUATION)
|
||||
@${MAKE} $(TEST_COMPARISON)
|
||||
endif
|
||||
@mkdir -p ${dir $@}
|
||||
@touch ${WORKDIR}/source.tcmodel
|
||||
@echo "# $(notdir ${@:.zip=})-${DATE}.zip" > ${WORKDIR}/README.md
|
||||
@echo '' >> ${WORKDIR}/README.md
|
||||
@echo "* dataset: ${DATASET}" >> ${WORKDIR}/README.md
|
||||
@echo "* model: ${MODELTYPE}" >> ${WORKDIR}/README.md
|
||||
@if [ -e ${BPESRCMODEL} ]; then \
|
||||
echo "* pre-processing: normalization + tokenization + BPE" >> ${WORKDIR}/README.md; \
|
||||
cp ${BPESRCMODEL} ${WORKDIR}/source.bpe; \
|
||||
cp ${BPETRGMODEL} ${WORKDIR}/target.bpe; \
|
||||
cp preprocess-bpe.sh ${WORKDIR}/preprocess.sh; \
|
||||
cp postprocess-bpe.sh ${WORKDIR}/postprocess.sh; \
|
||||
elif [ -e ${SPMSRCMODEL} ]; then \
|
||||
echo "* pre-processing: normalization + SentencePiece" >> ${WORKDIR}/README.md; \
|
||||
cp ${SPMSRCMODEL} ${WORKDIR}/source.spm; \
|
||||
cp ${SPMTRGMODEL} ${WORKDIR}/target.spm; \
|
||||
cp preprocess-spm.sh ${WORKDIR}/preprocess.sh; \
|
||||
cp postprocess-spm.sh ${WORKDIR}/postprocess.sh; \
|
||||
fi
|
||||
@if [ ${words ${TRGLANGS}} -gt 1 ]; then \
|
||||
echo '* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)' \
|
||||
>> ${WORKDIR}/README.md; \
|
||||
fi
|
||||
@echo "* download: [$(notdir ${@:.zip=})-${DATE}.zip](${MODELS_URL}/${LANGSTR}/$(notdir ${@:.zip=})-${DATE}.zip)" >> ${WORKDIR}/README.md
|
||||
if [ -e $(TEST_EVALUATION) ]; then \
|
||||
echo "* test set translations: [$(notdir ${@:.zip=})-${DATE}.test.txt](${MODELS_URL}/${LANGSTR}/$(notdir ${@:.zip=})-${DATE}.test.txt)" >> ${WORKDIR}/README.md; \
|
||||
echo "* test set scores: [$(notdir ${@:.zip=})-${DATE}.eval.txt](${MODELS_URL}/${LANGSTR}/$(notdir ${@:.zip=})-${DATE}.eval.txt)" >> ${WORKDIR}/README.md; \
|
||||
echo '' >> ${WORKDIR}/README.md; \
|
||||
echo '## Benchmarks' >> ${WORKDIR}/README.md; \
|
||||
echo '' >> ${WORKDIR}/README.md; \
|
||||
cd ${WORKDIR}; \
|
||||
grep -H BLEU *k${NR}.*eval | \
|
||||
tr '.' '/' | cut -f1,5,6 -d '/' | tr '/' "." > $@.1; \
|
||||
grep BLEU *k${NR}.*eval | cut -f3 -d ' ' > $@.2; \
|
||||
grep chrF *k${NR}.*eval | cut -f3 -d ' ' > $@.3; \
|
||||
echo '| testset | BLEU | chr-F |' >> README.md; \
|
||||
echo '|-----------------------|-------|-------|' >> README.md; \
|
||||
paste $@.1 $@.2 $@.3 | sed "s/\t/ | /g;s/^/| /;s/$$/ |/" >> README.md; \
|
||||
rm -f $@.1 $@.2 $@.3; \
|
||||
fi
|
||||
@cat ${WORKDIR}/README.md >> ${dir $@}README.md
|
||||
@echo '' >> ${dir $@}README.md
|
||||
@cp LICENSE ${WORKDIR}/
|
||||
@chmod +x ${WORKDIR}/preprocess.sh
|
||||
@sed -e 's# - /.*/\([^/]*\)$$# - \1#' \
|
||||
-e 's/beam-size: [0-9]*$$/beam-size: 6/' \
|
||||
-e 's/mini-batch: [0-9]*$$/mini-batch: 1/' \
|
||||
-e 's/maxi-batch: [0-9]*$$/maxi-batch: 1/' \
|
||||
-e 's/relative-paths: false/relative-paths: true/' \
|
||||
< ${MODEL_DECODER} > ${WORKDIR}/decoder.yml
|
||||
@cd ${WORKDIR} && zip ${notdir $@} \
|
||||
README.md LICENSE \
|
||||
${notdir ${MODEL_FINAL}} \
|
||||
${notdir ${MODEL_VOCAB}} \
|
||||
${notdir ${MODEL_VALIDLOG}} \
|
||||
${notdir ${MODEL_TRAINLOG}} \
|
||||
source.* target.* decoder.yml preprocess.sh postprocess.sh
|
||||
@mkdir -p ${dir $@}
|
||||
@mv -f ${WORKDIR}/${notdir $@} ${@:.zip=}-${DATE}.zip
|
||||
if [ -e $(TEST_EVALUATION) ]; then \
|
||||
cp $(TEST_EVALUATION) ${@:.zip=}-${DATE}.eval.txt; \
|
||||
cp $(TEST_COMPARISON) ${@:.zip=}-${DATE}.test.txt; \
|
||||
fi
|
||||
@rm -f $@
|
||||
@cd ${dir $@} && ln -s $(notdir ${@:.zip=})-${DATE}.zip ${notdir $@}
|
||||
@rm -f ${WORKDIR}/decoder.yml ${WORKDIR}/source.* ${WORKDIR}/target.*
|
||||
@rm -f ${WORKDIR}/preprocess.sh ${WORKDIR}/postprocess.sh
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
EVALSCORES = ${patsubst ${WORKHOME}/%.eval,${WORKHOME}/eval/%.eval.txt,${wildcard ${WORKHOME}/*/*.eval}}
|
||||
EVALTRANSL = ${patsubst ${WORKHOME}/%.compare,${WORKHOME}/eval/%.test.txt,${wildcard ${WORKHOME}/*/*.compare}}
|
||||
|
||||
|
||||
## upload to Object Storage
|
||||
## Don't forget to run this before uploading!
|
||||
# source project_2000661-openrc.sh
|
||||
#
|
||||
# - make upload ......... released models = all sub-dirs in models/
|
||||
# - make upload-models .. trained models in current WORKHOME to OPUS-MT-dev
|
||||
# - make upload-scores .. score file with benchmark results to OPUS-MT-eval
|
||||
# - make upload-eval .... benchmark tests from models in WORKHOME
|
||||
# - make upload-images .. images of VMs that run OPUS-MT
|
||||
|
||||
upload:
|
||||
cd models && swift upload OPUS-MT-models --changed --skip-identical *
|
||||
swift post OPUS-MT-models --read-acl ".r:*"
|
||||
swift list OPUS-MT-models > index.txt
|
||||
swift upload OPUS-MT-models index.txt
|
||||
rm -f index.txt
|
||||
|
||||
upload-models:
|
||||
cd ${WORKHOME} && swift upload OPUS-MT-dev --changed --skip-identical models
|
||||
swift post OPUS-MT-dev --read-acl ".r:*"
|
||||
swift list OPUS-MT-dev > index.txt
|
||||
swift upload OPUS-MT-dev index.txt
|
||||
rm -f index.txt
|
||||
|
||||
upload-scores: ${WORKHOME}/eval/scores.txt
|
||||
cd ${WORKHOME} && swift upload OPUS-MT-eval --changed --skip-identical eval/scores.txt
|
||||
swift post OPUS-MT-eval --read-acl ".r:*"
|
||||
|
||||
upload-eval: ${EVALSCORES} ${EVALTRANSL}
|
||||
cd ${WORKHOME} && swift upload OPUS-MT-eval --changed --skip-identical eval
|
||||
swift post OPUS-MT-eval --read-acl ".r:*"
|
||||
|
||||
upload-images:
|
||||
cd ${WORKHOME} && swift upload OPUS-MT --changed --skip-identical \
|
||||
--use-slo --segment-size 5G opusMT-images
|
||||
swift post OPUS-MT-images --read-acl ".r:*"
|
||||
|
||||
|
||||
|
||||
## this is for the multeval scores
|
||||
# ${WORKHOME}/eval/scores.txt: ${EVALSCORES}
|
||||
# cd ${WORKHOME} && \
|
||||
# grep base */*eval | cut -f1,2- -d '/' | cut -f1,6- -d '.' | \
|
||||
# sed 's/-/ /' | sed 's/\// /' | sed 's/ ([^)]*)//g' |\
|
||||
# sed 's/.eval:baseline//' | sed "s/ */\t/g" | sort > $@
|
||||
|
||||
${WORKHOME}/eval/scores.txt: ${EVALSCORES}
|
||||
cd ${WORKHOME} && grep BLEU */*k${NR}.*eval | cut -f1 -d '/' | tr '-' "\t" > $@.1
|
||||
cd ${WORKHOME} && grep BLEU */*k${NR}.*eval | tr '.' '/' | cut -f2,6,7 -d '/' | tr '/' "." > $@.2
|
||||
cd ${WORKHOME} && grep BLEU */*k${NR}.*eval | cut -f3 -d ' ' > $@.3
|
||||
cd ${WORKHOME} && grep chrF */*k${NR}.*eval | cut -f3 -d ' ' > $@.4
|
||||
paste $@.1 $@.2 $@.3 $@.4 > $@
|
||||
rm -f $@.1 $@.2 $@.3 $@.4
|
||||
|
||||
|
||||
${EVALSCORES}: # ${WORKHOME}/eval/%.eval.txt: ${WORKHOME}/models/%.eval
|
||||
mkdir -p ${dir $@}
|
||||
cp ${patsubst ${WORKHOME}/eval/%.eval.txt,${WORKHOME}/%.eval,$@} $@
|
||||
# cp $< $@
|
||||
|
||||
${EVALTRANSL}: # ${WORKHOME}/eval/%.test.txt: ${WORKHOME}/models/%.compare
|
||||
mkdir -p ${dir $@}
|
||||
cp ${patsubst ${WORKHOME}/eval/%.test.txt,${WORKHOME}/%.compare,$@} $@
|
||||
# cp $< $@
|
||||
|
||||
|
||||
|
||||
|
||||
# ## dangerous area ....
|
||||
# delete-eval:
|
||||
# swift delete OPUS-MT eval
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
######################################################################
|
||||
## handle old models in previous work directories
|
||||
## obsolete now?
|
||||
######################################################################
|
||||
|
||||
|
||||
|
||||
##-----------------------------------
|
||||
## make packages from trained models
|
||||
## check old-models as well!
|
||||
|
||||
TRAINED_NEW_MODELS = ${patsubst ${WORKHOME}/%/,%,${dir ${wildcard ${WORKHOME}/*/*.best-perplexity.npz}}}
|
||||
# TRAINED_OLD_MODELS = ${patsubst ${WORKHOME}/old-models/%/,%,${dir ${wildcard ${WORKHOME}/old-models/*/*.best-perplexity.npz}}}
|
||||
TRAINED_OLD_MODELS = ${patsubst ${WORKHOME}/old-models/%/,%,${dir ${wildcard ${WORKHOME}/old-models/??-??/*.best-perplexity.npz}}}
|
||||
|
||||
TRAINED_OLD_ONLY_MODELS = ${filter-out ${TRAINED_NEW_MODELS},${TRAINED_OLD_MODELS}}
|
||||
TRAINED_NEW_ONLY_MODELS = ${filter-out ${TRAINED_OLD_MODELS},${TRAINED_NEW_MODELS}}
|
||||
TRAINED_DOUBLE_MODELS = ${filter ${TRAINED_NEW_MODELS},${TRAINED_OLD_MODELS}}
|
||||
|
||||
## make packages of all new models
|
||||
## unless there are better models in old-models
|
||||
new-models-dist:
|
||||
@echo "nr of extra models: ${words ${TRAINED_NEW_ONLY_MODELS}}"
|
||||
for l in ${TRAINED_NEW_ONLY_MODELS}; do \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" dist; \
|
||||
done
|
||||
@echo "trained double ${words ${TRAINED_DOUBLE_MODELS}}"
|
||||
for l in ${TRAINED_DOUBLE_MODELS}; do \
|
||||
n=`grep 'new best' work/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
|
||||
o=`grep 'new best' work/old-models/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
|
||||
if (( $$(echo "$$n < $$o" |bc -l) )); then \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" dist; \
|
||||
fi \
|
||||
done
|
||||
|
||||
|
||||
## fix decoder path in old-models (to run evaluations
|
||||
fix-decoder-path:
|
||||
for l in ${wildcard ${WORKHOME}/old-models/*/*.best-perplexity.npz.decoder.yml}; do \
|
||||
sed --in-place=.backup 's#/\(..-..\)/opus#/old-models/\1/opus#' $$l; \
|
||||
sed --in-place=.backup2 's#/old-models/old-models/#/old-models/#' $$l; \
|
||||
sed --in-place=.backup2 's#/old-models/old-models/#/old-models/#' $$l; \
|
||||
done
|
||||
|
||||
## make packages of all old models from old-models
|
||||
## unless there are better models in work (new models)
|
||||
old-models-dist:
|
||||
@echo "nr of extra models: ${words ${TRAINED_OLD_ONLY_MODELS}}"
|
||||
for l in ${TRAINED_OLD_ONLY_MODELS}; do \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" \
|
||||
WORKHOME=${WORKHOME}/old-models \
|
||||
MODELSHOME=${WORKHOME}/models dist; \
|
||||
done
|
||||
@echo "trained double ${words ${TRAINED_DOUBLE_MODELS}}"
|
||||
for l in ${TRAINED_DOUBLE_MODELS}; do \
|
||||
n=`grep 'new best' work/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
|
||||
o=`grep 'new best' work/old-models/$$l/*.valid1.log | tail -1 | cut -f12 -d ' '`; \
|
||||
if (( $$(echo "$$o < $$n" |bc -l) )); then \
|
||||
${MAKE} SRCLANGS="`echo $$l | cut -f1 -d'-' | sed 's/\\+/ /g'`" \
|
||||
TRGLANGS="`echo $$l | cut -f2 -d'-' | sed 's/\\+/ /g'`" \
|
||||
WORKHOME=${WORKHOME}/old-models \
|
||||
MODELSHOME=${WORKHOME}/models dist; \
|
||||
else \
|
||||
echo "$$l: new better than old"; \
|
||||
fi \
|
||||
done
|
||||
|
||||
|
||||
|
||||
## old models had slightly different naming conventions
|
||||
|
||||
LASTSRC = ${lastword ${SRCLANGS}}
|
||||
LASTTRG = ${lastword ${TRGLANGS}}
|
||||
|
||||
MODEL_OLD = ${MODEL_SUBDIR}${DATASET}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}.${LASTSRC}${LASTTRG}
|
||||
MODEL_OLD_BASENAME = ${MODEL_OLD}.${MODELTYPE}.model${NR}
|
||||
MODEL_OLD_FINAL = ${WORKDIR}/${MODEL_OLD_BASENAME}.npz.best-perplexity.npz
|
||||
MODEL_OLD_VOCAB = ${WORKDIR}/${MODEL_OLD}.vocab.${MODEL_VOCABTYPE}
|
||||
MODEL_OLD_DECODER = ${MODEL_OLD_FINAL}.decoder.yml
|
||||
MODEL_TRANSLATE = ${WORKDIR}/${TESTSET}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}
|
||||
MODEL_OLD_TRANSLATE = ${WORKDIR}/${TESTSET}.${MODEL_OLD}${NR}.${MODELTYPE}.${SRC}.${TRG}
|
||||
MODEL_OLD_VALIDLOG = ${MODEL_OLD}.${MODELTYPE}.valid${NR}.log
|
||||
MODEL_OLD_TRAINLOG = ${MODEL_OLD}.${MODELTYPE}.train${NR}.log
|
||||
|
||||
|
||||
link-old-models:
|
||||
if [ ! -e ${MODEL_FINAL} ]; then \
|
||||
if [ -e ${MODEL_OLD_FINAL} ]; then \
|
||||
ln -s ${MODEL_OLD_FINAL} ${MODEL_FINAL}; \
|
||||
ln -s ${MODEL_OLD_VOCAB} ${MODEL_VOCAB}; \
|
||||
ln -s ${MODEL_OLD_DECODER} ${MODEL_DECODER}; \
|
||||
fi \
|
||||
fi
|
||||
if [ ! -e ${MODEL_TRANSLATE} ]; then \
|
||||
if [ -e ${MODEL_OLD_TRANSLATE} ]; then \
|
||||
ln -s ${MODEL_OLD_TRANSLATE} ${MODEL_TRANSLATE}; \
|
||||
fi \
|
||||
fi
|
||||
if [ ! -e ${WORKDIR}/${MODEL_VALIDLOG} ]; then \
|
||||
if [ -e ${WORKDIR}/${MODEL_OLD_VALIDLOG} ]; then \
|
||||
ln -s ${WORKDIR}/${MODEL_OLD_VALIDLOG} ${WORKDIR}/${MODEL_VALIDLOG}; \
|
||||
ln -s ${WORKDIR}/${MODEL_OLD_TRAINLOG} ${WORKDIR}/${MODEL_TRAINLOG}; \
|
||||
fi \
|
||||
fi
|
||||
rm -f ${MODEL_TRANSLATE}.eval
|
||||
rm -f ${MODEL_TRANSLATE}.compare
|
||||
|
60
Makefile.doclevel
Normal file
60
Makefile.doclevel
Normal file
@ -0,0 +1,60 @@
|
||||
# -*-makefile-*-
|
||||
|
||||
|
||||
DOCLEVEL_BENCHMARK_DATA = https://zenodo.org/record/3525366/files/doclevel-MT-benchmark-discomt2019.zip
|
||||
|
||||
|
||||
## use the doclevel benchmark data sets
|
||||
%-ost:
|
||||
${MAKE} ost-datasets
|
||||
${MAKE} SRCLANGS=en TRGLANGS=de \
|
||||
TRAINSET=ost-train \
|
||||
DEVSET=ost-dev \
|
||||
TESTSET=ost-test \
|
||||
DEVSIZE=100000 TESTSIZE=100000 HELDOUTSIZE=0 \
|
||||
${@:-ost=}
|
||||
|
||||
|
||||
|
||||
|
||||
ost-datasets: ${DATADIR}/${PRE}/ost-train.de-en.clean.de.gz \
|
||||
${DATADIR}/${PRE}/ost-train.de-en.clean.en.gz \
|
||||
${DATADIR}/${PRE}/ost-dev.de-en.clean.de.gz \
|
||||
${DATADIR}/${PRE}/ost-dev.de-en.clean.en.gz \
|
||||
${DATADIR}/${PRE}/ost-test.de-en.clean.de.gz \
|
||||
${DATADIR}/${PRE}/ost-test.de-en.clean.en.gz
|
||||
|
||||
|
||||
.INTERMEDIATE: ${WORKHOME}/doclevel-MT-benchmark
|
||||
|
||||
## download the doc-level data set
|
||||
${WORKHOME}/doclevel-MT-benchmark:
|
||||
wget -O $@.zip DOCLEVEL_BENCHMARK_DATA?download=1
|
||||
unzip -d ${dir $@} $@.zip
|
||||
rm -f $@.zip
|
||||
|
||||
${DATADIR}/${PRE}/ost-train.de-en.clean.de.gz: ${WORKHOME}/doclevel-MT-benchmark
|
||||
mkdir -p ${dir $@}
|
||||
$(TOKENIZER)/detokenizer.perl -l de < $</train/ost.tok.de | gzip -c > $@
|
||||
|
||||
${DATADIR}/${PRE}/ost-train.de-en.clean.en.gz: ${WORKHOME}/doclevel-MT-benchmark
|
||||
mkdir -p ${dir $@}
|
||||
$(TOKENIZER)/detokenizer.perl -l en < $</train/ost.tok.en | gzip -c > $@
|
||||
|
||||
${DATADIR}/${PRE}/ost-dev.de-en.clean.de.gz: ${WORKHOME}/doclevel-MT-benchmark
|
||||
mkdir -p ${dir $@}
|
||||
$(TOKENIZER)/detokenizer.perl -l de < $</dev/ost.tok.de | gzip -c > $@
|
||||
|
||||
${DATADIR}/${PRE}/ost-dev.de-en.clean.en.gz: ${WORKHOME}/doclevel-MT-benchmark
|
||||
mkdir -p ${dir $@}
|
||||
$(TOKENIZER)/detokenizer.perl -l en < $</dev/ost.tok.en | gzip -c > $@
|
||||
|
||||
${DATADIR}/${PRE}/ost-test.de-en.clean.de.gz: ${WORKHOME}/doclevel-MT-benchmark
|
||||
mkdir -p ${dir $@}
|
||||
$(TOKENIZER)/detokenizer.perl -l de < $</test/ost.tok.de | gzip -c > $@
|
||||
|
||||
${DATADIR}/${PRE}/ost-test.de-en.clean.en.gz: ${WORKHOME}/doclevel-MT-benchmark
|
||||
mkdir -p ${dir $@}
|
||||
$(TOKENIZER)/detokenizer.perl -l en < $</test/ost.tok.en | gzip -c > $@
|
||||
|
||||
|
126
Makefile.env
Normal file
126
Makefile.env
Normal file
@ -0,0 +1,126 @@
|
||||
# -*-makefile-*-
|
||||
#
|
||||
# settings of the environment
|
||||
# - essential tools and their paths
|
||||
# - system-specific settings
|
||||
#
|
||||
|
||||
|
||||
## modules to be loaded in sbatch scripts
|
||||
|
||||
CPU_MODULES = gcc/6.2.0 mkl
|
||||
GPU_MODULES = cuda-env/8 mkl
|
||||
# GPU_MODULES = python-env/3.5.3-ml cuda-env/8 mkl
|
||||
|
||||
|
||||
# job-specific settings (overwrite if necessary)
|
||||
# HPC_EXTRA: additional SBATCH commands
|
||||
|
||||
NR_GPUS = 1
|
||||
HPC_NODES = 1
|
||||
HPC_DISK = 500
|
||||
HPC_QUEUE = serial
|
||||
# HPC_MODULES = nlpl-opus python-env/3.4.1 efmaral moses
|
||||
# HPC_MODULES = nlpl-opus moses cuda-env marian python-3.5.3-ml
|
||||
HPC_MODULES = ${GPU_MODULES}
|
||||
HPC_EXTRA =
|
||||
|
||||
MEM = 4g
|
||||
THREADS = 1
|
||||
WALLTIME = 72
|
||||
|
||||
|
||||
## set variables with HPC prefix
|
||||
|
||||
ifndef HPC_TIME
|
||||
HPC_TIME = ${WALLTIME}:00
|
||||
endif
|
||||
|
||||
ifndef HPC_CORES
|
||||
HPC_CORES = ${THREADS}
|
||||
endif
|
||||
|
||||
ifndef HPC_MEM
|
||||
HPC_MEM = ${MEM}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
|
||||
# GPU = k80
|
||||
GPU = p100
|
||||
DEVICE = cuda
|
||||
LOADCPU = module load ${CPU_MODULES}
|
||||
LOADGPU = module load ${GPU_MODULES}
|
||||
|
||||
ifeq (${shell hostname},dx6-ibs-p2)
|
||||
APPLHOME = /opt/tools
|
||||
WORKHOME = ${shell realpath ${PWD}/work-spm}
|
||||
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
|
||||
MOSESHOME = ${APPLHOME}/mosesdecoder
|
||||
MARIAN = ${APPLHOME}/marian/build
|
||||
LOADMODS = echo "nothing to load"
|
||||
else ifeq (${shell hostname},dx7-nkiel-4gpu)
|
||||
APPLHOME = /opt/tools
|
||||
WORKHOME = ${shell realpath ${PWD}/work-spm}
|
||||
OPUSHOME = tiedeman@taito.csc.fi:/proj/nlpl/data/OPUS/
|
||||
MOSESHOME = ${APPLHOME}/mosesdecoder
|
||||
MARIAN = ${APPLHOME}/marian/build
|
||||
LOADMODS = echo "nothing to load"
|
||||
else ifneq ($(wildcard /wrk/tiedeman/research/),)
|
||||
DATAHOME = /proj/OPUS/WMT19/data/${LANGPAIR}
|
||||
# APPLHOME = ${USERAPPL}/tools
|
||||
APPLHOME = /proj/memad/tools
|
||||
WORKHOME = /wrk/tiedeman/research/Opus-MT/work-spm
|
||||
OPUSHOME = /proj/nlpl/data/OPUS
|
||||
MOSESHOME = /proj/nlpl/software/moses/4.0-65c75ff/moses
|
||||
# MARIAN = /proj/nlpl/software/marian/1.2.0
|
||||
# MARIAN = /appl/ling/marian
|
||||
MARIAN = ${HOME}/appl_taito/tools/marian/build-gpu
|
||||
MARIANCPU = ${HOME}/appl_taito/tools/marian/build-cpu
|
||||
LOADMODS = ${LOADGPU}
|
||||
else
|
||||
CSCPROJECT = project_2001194
|
||||
# CSCPROJECT = project_2000309
|
||||
DATAHOME = ${HOME}/work/opentrans/data/${LANGPAIR}
|
||||
WORKHOME = ${shell realpath ${PWD}/work-spm}
|
||||
APPLHOME = ${HOME}/projappl
|
||||
# OPUSHOME = /scratch/project_2000661/nlpl/data/OPUS
|
||||
OPUSHOME = /projappl/nlpl/data/OPUS
|
||||
MOSESHOME = ${APPLHOME}/mosesdecoder
|
||||
EFLOMAL_HOME = ${APPLHOME}/eflomal/
|
||||
# MARIAN = ${APPLHOME}/marian/build
|
||||
# MARIANCPU = ${APPLHOME}/marian/build
|
||||
MARIAN = ${APPLHOME}/marian-dev/build-spm
|
||||
MARIANCPU = ${APPLHOME}/marian-dev/build-cpu
|
||||
MARIANSPM = ${APPLHOME}/marian-dev/build-spm
|
||||
# GPU_MODULES = cuda intel-mkl
|
||||
GPU = v100
|
||||
GPU_MODULES = python-env
|
||||
# gcc/8.3.0 boost/1.68.0-mpi intel-mkl
|
||||
CPU_MODULES = python-env
|
||||
LOADMODS = echo "nothing to load"
|
||||
HPC_QUEUE = small
|
||||
endif
|
||||
|
||||
|
||||
|
||||
ifdef LOCAL_SCRATCH
|
||||
TMPDIR = ${LOCAL_SCRATCH}
|
||||
endif
|
||||
|
||||
|
||||
|
||||
## other tools and their locations
|
||||
|
||||
WORDALIGN = ${EFLOMAL_HOME}align.py
|
||||
ATOOLS = ${FASTALIGN_HOME}atools
|
||||
|
||||
MULTEVALHOME = ${APPLHOME}/multeval
|
||||
MOSESSCRIPTS = ${MOSESHOME}/scripts
|
||||
TOKENIZER = ${MOSESSCRIPTS}/tokenizer
|
||||
SNMTPATH = ${APPLHOME}/subword-nmt/subword_nmt
|
||||
|
||||
|
||||
## SentencePiece
|
||||
SPM_HOME = ${MARIANSPM}
|
99
Makefile.slurm
Normal file
99
Makefile.slurm
Normal file
@ -0,0 +1,99 @@
|
||||
# -*-makefile-*-
|
||||
|
||||
|
||||
|
||||
# enable e-mail notification by setting EMAIL
|
||||
|
||||
WHOAMI = $(shell whoami)
|
||||
ifeq ("$(WHOAMI)","tiedeman")
|
||||
EMAIL = jorg.tiedemann@helsinki.fi
|
||||
endif
|
||||
|
||||
|
||||
##---------------------------------------------
|
||||
## submit jobs
|
||||
##---------------------------------------------
|
||||
|
||||
|
||||
## submit job to gpu queue
|
||||
|
||||
%.submit:
|
||||
mkdir -p ${WORKDIR}
|
||||
echo '#!/bin/bash -l' > $@
|
||||
echo '#SBATCH -J "${DATASET}-${@:.submit=}"' >>$@
|
||||
echo '#SBATCH -o ${DATASET}-${@:.submit=}.out.%j' >> $@
|
||||
echo '#SBATCH -e ${DATASET}-${@:.submit=}.err.%j' >> $@
|
||||
echo '#SBATCH --mem=${HPC_MEM}' >> $@
|
||||
# echo '#SBATCH --exclude=r18g05' >> $@
|
||||
ifdef EMAIL
|
||||
echo '#SBATCH --mail-type=END' >> $@
|
||||
echo '#SBATCH --mail-user=${EMAIL}' >> $@
|
||||
endif
|
||||
echo '#SBATCH -n 1' >> $@
|
||||
echo '#SBATCH -N 1' >> $@
|
||||
echo '#SBATCH -p gpu' >> $@
|
||||
ifeq (${shell hostname --domain},bullx)
|
||||
echo '#SBATCH --account=${CSCPROJECT}' >> $@
|
||||
echo '#SBATCH --gres=gpu:${GPU}:${NR_GPUS},nvme:${HPC_DISK}' >> $@
|
||||
else
|
||||
echo '#SBATCH --gres=gpu:${GPU}:${NR_GPUS}' >> $@
|
||||
endif
|
||||
echo '#SBATCH -t ${HPC_TIME}:00' >> $@
|
||||
echo 'module use -a /proj/nlpl/modules' >> $@
|
||||
for m in ${GPU_MODULES}; do \
|
||||
echo "module load $$m" >> $@; \
|
||||
done
|
||||
echo 'module list' >> $@
|
||||
echo 'cd $${SLURM_SUBMIT_DIR:-.}' >> $@
|
||||
echo 'pwd' >> $@
|
||||
echo 'echo "Starting at `date`"' >> $@
|
||||
echo 'srun ${MAKE} ${MAKEARGS} ${@:.submit=}' >> $@
|
||||
echo 'echo "Finishing at `date`"' >> $@
|
||||
sbatch $@
|
||||
mkdir -p ${WORKDIR}
|
||||
mv $@ ${WORKDIR}/$@
|
||||
|
||||
# echo 'srun ${MAKE} NR=${NR} MODELTYPE=${MODELTYPE} DATASET=${DATASET} SRC=${SRC} TRG=${TRG} PRE_SRC=${PRE_SRC} PRE_TRG=${PRE_TRG} ${MAKEARGS} ${@:.submit=}' >> $@
|
||||
|
||||
|
||||
## submit job to cpu queue
|
||||
|
||||
%.submitcpu:
|
||||
mkdir -p ${WORKDIR}
|
||||
echo '#!/bin/bash -l' > $@
|
||||
echo '#SBATCH -J "${@:.submitcpu=}"' >>$@
|
||||
echo '#SBATCH -o ${@:.submitcpu=}.out.%j' >> $@
|
||||
echo '#SBATCH -e ${@:.submitcpu=}.err.%j' >> $@
|
||||
echo '#SBATCH --mem=${HPC_MEM}' >> $@
|
||||
ifdef EMAIL
|
||||
echo '#SBATCH --mail-type=END' >> $@
|
||||
echo '#SBATCH --mail-user=${EMAIL}' >> $@
|
||||
endif
|
||||
ifeq (${shell hostname --domain},bullx)
|
||||
echo '#SBATCH --account=${CSCPROJECT}' >> $@
|
||||
echo '#SBATCH --gres=nvme:${HPC_DISK}' >> $@
|
||||
# echo '#SBATCH --exclude=r05c49' >> $@
|
||||
# echo '#SBATCH --exclude=r07c51' >> $@
|
||||
# echo '#SBATCH --exclude=r06c50' >> $@
|
||||
endif
|
||||
echo '#SBATCH -n ${HPC_CORES}' >> $@
|
||||
echo '#SBATCH -N ${HPC_NODES}' >> $@
|
||||
echo '#SBATCH -p ${HPC_QUEUE}' >> $@
|
||||
echo '#SBATCH -t ${HPC_TIME}:00' >> $@
|
||||
echo '${HPC_EXTRA}' >> $@
|
||||
echo 'module use -a /proj/nlpl/modules' >> $@
|
||||
for m in ${CPU_MODULES}; do \
|
||||
echo "module load $$m" >> $@; \
|
||||
done
|
||||
echo 'module list' >> $@
|
||||
echo 'cd $${SLURM_SUBMIT_DIR:-.}' >> $@
|
||||
echo 'pwd' >> $@
|
||||
echo 'echo "Starting at `date`"' >> $@
|
||||
echo '${MAKE} -j ${HPC_CORES} ${MAKEARGS} ${@:.submitcpu=}' >> $@
|
||||
echo 'echo "Finishing at `date`"' >> $@
|
||||
sbatch $@
|
||||
mkdir -p ${WORKDIR}
|
||||
mv $@ ${WORKDIR}/$@
|
||||
|
||||
|
||||
# echo '${MAKE} -j ${HPC_CORES} DATASET=${DATASET} SRC=${SRC} TRG=${TRG} PRE_SRC=${PRE_SRC} PRE_TRG=${PRE_TRG} ${MAKEARGS} ${@:.submitcpu=}' >> $@
|
485
Makefile.tasks
Normal file
485
Makefile.tasks
Normal file
@ -0,0 +1,485 @@
|
||||
# -*-makefile-*-
|
||||
#
|
||||
# pre-defined tasks that we might want to run
|
||||
#
|
||||
|
||||
|
||||
|
||||
MEMAD_LANGS = de en fi fr nl sv
|
||||
|
||||
# GERMANIC = en de nl fy af da fo is no nb nn sv
|
||||
GERMANIC = de nl fy af da fo is no nb nn sv
|
||||
WESTGERMANIC = de nl af fy
|
||||
SCANDINAVIAN = da fo is no nb nn sv
|
||||
ROMANCE = ca es fr gl it la oc pt_br pt ro
|
||||
FINNO_UGRIC = fi et hu
|
||||
PIVOT = en
|
||||
|
||||
ifndef LANGS
|
||||
LANGS = ${MEMAD_LANGS}
|
||||
endif
|
||||
|
||||
|
||||
## run things with individual data sets only
|
||||
%-fiskmo:
|
||||
${MAKE} TRAINSET=fiskmo ${@:-fiskmo=}
|
||||
|
||||
%-opensubtitles:
|
||||
${MAKE} TRAINSET=OpenSubtitles ${@:-opensubtitles=}
|
||||
|
||||
%-finlex:
|
||||
${MAKE} TRAINSET=Finlex ${@:-finlex=}
|
||||
|
||||
|
||||
|
||||
## a batch of interesting models ....
|
||||
|
||||
|
||||
## germanic to germanic
|
||||
|
||||
germanic:
|
||||
${MAKE} LANGS="${GERMANIC}" HPC_DISK=1500 multilingual
|
||||
|
||||
scandinavian:
|
||||
${MAKE} LANGS="${SCANDINAVIAN}" multilingual-medium
|
||||
|
||||
memad2en:
|
||||
${MAKE} LANGS="${MEMAD_LANGS}" PIVOT=en all2pivot
|
||||
|
||||
fiet:
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=et bilingual-medium
|
||||
|
||||
icelandic:
|
||||
${MAKE} SRCLANGS=is TRGLANGS=en bilingual
|
||||
${MAKE} SRCLANGS=is TRGLANGS="da no nn nb sv" bilingual
|
||||
${MAKE} SRCLANGS=is TRGLANGS=fi bilingual
|
||||
|
||||
enru-yandex:
|
||||
${MAKE} DATASET=opus+yandex SRCLANGS=ru TRGLANGS=en EXTRA_TRAINSET=yandex data
|
||||
${MAKE} DATASET=opus+yandex SRCLANGS=ru TRGLANGS=en EXTRA_TRAINSET=yandex reverse-data
|
||||
${MAKE} DATASET=opus+yandex SRCLANGS=en TRGLANGS=ru EXTRA_TRAINSET=yandex \
|
||||
WALLTIME=72 HPC_CORES=1 HPC_MEM=8g MARIAN_WORKSPACE=12000 train.submit-multigpu
|
||||
${MAKE} DATASET=opus+yandex SRCLANGS=ru TRGLANGS=en EXTRA_TRAINSET=yandex \
|
||||
WALLTIME=72 HPC_CORES=1 HPC_MEM=4g train.submit-multigpu
|
||||
|
||||
|
||||
unidirectional:
|
||||
${MAKE} data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
bilingual:
|
||||
${MAKE} data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
${MAKE} reverse-data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
bilingual-medium:
|
||||
${MAKE} data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
|
||||
MARIAN_VALID_FREQ=5000 MARIAN_WORKSPACE=10000 train.submit
|
||||
${MAKE} reverse-data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
|
||||
MARIAN_VALID_FREQ=5000 MARIAN_WORKSPACE=10000 train.submit
|
||||
|
||||
bilingual-small:
|
||||
${MAKE} data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
|
||||
MARIAN_WORKSPACE=5000 MARIAN_VALID_FREQ=2500 train.submit
|
||||
${MAKE} reverse-data
|
||||
${MAKE} WALLTIME=72 HPC_MEM=4g HPC_CORES=1 \
|
||||
MARIAN_WORKSPACE=5000 MARIAN_VALID_FREQ=2500 train.submit
|
||||
|
||||
|
||||
|
||||
multilingual:
|
||||
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" data
|
||||
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" \
|
||||
WALLTIME=72 HPC_CORES=1 HPC_MEM=4g train.submit-multigpu
|
||||
|
||||
|
||||
multilingual-medium:
|
||||
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" data
|
||||
${MAKE} SRCLANGS="${LANGS}" TRGLANGS="${LANGS}" \
|
||||
MARIAN_VALID_FREQ=5000 MARIAN_WORKSPACE=10000 \
|
||||
WALLTIME=72 HPC_CORES=1 HPC_MEM=4g train.submit-multigpu
|
||||
|
||||
|
||||
all2pivot:
|
||||
for l in ${filter-out ${PIVOT},${LANGS}}; do \
|
||||
${MAKE} SRCLANGS="$$l" TRGLANGS="${PIVOT}" data; \
|
||||
${MAKE} SRCLANGS="$$l" TRGLANGS="${PIVOT}" HPC_CORES=1 HPC_MEM=4g train.submit-multigpu; \
|
||||
${MAKE} SRCLANGS="$$l" TRGLANGS="${PIVOT}" reverse-data; \
|
||||
${MAKE} SRCLANGS="${PIVOT}" TRGLANGS="$$l" HPC_CORES=1 HPC_MEM=4g train.submit-multigpu; \
|
||||
done
|
||||
|
||||
bilingual-dynamic:
|
||||
if [ ! -e "${WORKHOME}/${LANGSTR}/train.submit" ]; then \
|
||||
${MAKE} data; \
|
||||
if [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 10000000 ]; then \
|
||||
echo "${LANGSTR} bigger than 10 million"; \
|
||||
${MAKE} HPC_CORES=1 HPC_MEM=8g train.submit-multigpu; \
|
||||
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
|
||||
${MAKE} reverse-data-spm; \
|
||||
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' HPC_CORES=1 HPC_MEM=8g train.submit-multigpu; \
|
||||
fi; \
|
||||
elif [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 1000000 ]; then \
|
||||
echo "${LANGSTR} bigger than 1 million"; \
|
||||
${MAKE} \
|
||||
MARIAN_VALID_FREQ=2500 \
|
||||
HPC_CORES=1 HPC_MEM=4g train.submit; \
|
||||
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
|
||||
${MAKE} reverse-data-spm; \
|
||||
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' \
|
||||
MARIAN_VALID_FREQ=2500 \
|
||||
HPC_CORES=1 HPC_MEM=4g train.submit; \
|
||||
fi; \
|
||||
elif [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 100000 ]; then \
|
||||
echo "${LANGSTR} bigger than 100k"; \
|
||||
${MAKE} \
|
||||
MARIAN_VALID_FREQ=1000 \
|
||||
MARIAN_WORKSPACE=5000 \
|
||||
MARIAN_VALID_MINI_BATCH=8 \
|
||||
MARIAN_EARLY_STOPPING=5 \
|
||||
HPC_CORES=1 HPC_MEM=4g train.submit; \
|
||||
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
|
||||
${MAKE} reverse-data-spm; \
|
||||
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' \
|
||||
MARIAN_VALID_FREQ=1000 \
|
||||
MARIAN_WORKSPACE=5000 \
|
||||
MARIAN_VALID_MINI_BATCH=8 \
|
||||
MARIAN_EARLY_STOPPING=5 \
|
||||
HPC_CORES=1 HPC_MEM=4g train.submit; \
|
||||
fi; \
|
||||
elif [ `zcat ${WORKHOME}/${LANGSTR}/train/*.src.clean.${PRE_SRC}.gz | wc -l` -gt 10000 ]; then \
|
||||
echo "${LANGSTR} bigger than 10k"; \
|
||||
${MAKE} \
|
||||
MARIAN_WORKSPACE=3500 \
|
||||
MARIAN_VALID_MINI_BATCH=4 \
|
||||
MARIAN_DROPOUT=0.5 \
|
||||
MARIAN_VALID_FREQ=1000 \
|
||||
MARIAN_EARLY_STOPPING=5 \
|
||||
HPC_CORES=1 HPC_MEM=4g train.submit; \
|
||||
if [ "${SRCLANGS}" != "${TRGLANGS}" ]; then \
|
||||
${MAKE} reverse-data-spm; \
|
||||
${MAKE} TRGLANGS="${SRCLANGS}" SRCLANGS='${TRGLANGS}' \
|
||||
MARIAN_WORKSPACE=3500 \
|
||||
MARIAN_VALID_MINI_BATCH=4 \
|
||||
MARIAN_DROPOUT=0.5 \
|
||||
MARIAN_VALID_FREQ=1000 \
|
||||
MARIAN_EARLY_STOPPING=5 \
|
||||
HPC_CORES=1 HPC_MEM=4g train.submit; \
|
||||
fi; \
|
||||
else \
|
||||
echo "${LANGSTR} too small"; \
|
||||
fi \
|
||||
fi
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# iso639 = aa ab ae af ak am an ar as av ay az ba be bg bh bi bm bn bo br bs ca ce ch cn co cr cs cu cv cy da de dv dz ee el en eo es et eu fa ff fi fj fo fr fy ga gd gl gn gr gu gv ha hb he hi ho hr ht hu hy hz ia id ie ig ik io is it iu ja jp jv ka kg ki kj kk kl km kn ko kr ks ku kv kw ky la lb lg li ln lo lt lu lv me mg mh mi mk ml mn mo mr ms mt my na nb nd ne ng nl nn no nr nv ny oc oj om or os pa pi pl po ps pt qu rm rn ro ru rw ry sa sc sd se sg sh si sk sl sm sn so sq sr ss st su sv sw ta tc te tg th ti tk tl tn to tr ts tt tw ty ua ug uk ur uz ve vi vo wa wo xh yi yo za zh zu
|
||||
|
||||
# NO_MEMAD = ${filter-out fi sv de fr nl,${iso639}}
|
||||
|
||||
|
||||
#"de_AT de_CH de_DE de"
|
||||
#"en_AU en_CA en_GB en_NZ en_US en_ZA en"
|
||||
#"it_IT if"
|
||||
#"es_AR es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_SV es_UY es_VE es"
|
||||
#"eu_ES eu"
|
||||
#"hi_IN hi"
|
||||
#"fr_BE fr_CA fr_FR fr"
|
||||
#"fa_AF fa_IR fa"
|
||||
#"ar_SY ar_TN ar"
|
||||
#"bn_IN bn"
|
||||
#da_DK
|
||||
#bg_BG
|
||||
#nb_NO
|
||||
#nl_BE nl_NL
|
||||
#tr_TR
|
||||
### ze_en - English subtitles in chinese movies
|
||||
|
||||
|
||||
OPUSLANGS = fi sv fr es de ar he "cmn cn yue ze_zh zh_cn zh_CN zh_HK zh_tw zh_TW zh_yue zhs zht zh" "pt_br pt_BR pt_PT pt" aa ab ace ach acm acu ada ady aeb aed ae afb afh af agr aha aii ain ajg aka ake akl ak aln alt alz amh ami amu am ang an aoc aoz apc ara arc arh arn arq ary arz ase asf ast as ati atj avk av awa aym ay azb "az_IR az" bal bam ban bar bas ba bbc bbj bci bcl bem ber "be_tarask be" bfi bg bho bhw bh bin bi bjn bm bn bnt bo bpy brx br bsn bs btg bts btx bua bug bum bvl bvy bxr byn byv bzj bzs cab cac cak cat cay ca "cbk_zam cbk" cce cdo ceb ce chf chj chk cho chq chr chw chy ch cjk cjp cjy ckb ckt cku cmo cnh cni cop co "crh_latn crh" crp crs cr csb cse csf csg csl csn csr cs cto ctu cuk cu cv cycl cyo cy daf da dga dhv dik din diq dje djk dng dop dsb dtp dty dua dv dws dyu dz ecs ee efi egl el eml enm eo esn et eu ewo ext fan fat fa fcs ff fil fj fkv fon foo fo frm frp frr fse fsl fuc ful fur fuv fy gaa gag gan ga gbi gbm gcf gcr gd gil glk gl gn gom gor gos got grc gr gsg gsm gss gsw guc gug gum gur guw gu gv gxx gym hai hak hau haw ha haz hb hch hds hif hi hil him hmn hne hnj hoc ho hrx hr hsb hsh hsn ht hup hus hu hyw hy hz ia iba ibg ibo id ie ig ike ik ilo inh inl ins io iro ise ish iso is it iu izh jak jam jap ja jbo jdt jiv jmx jp jsl jv kaa kab kac kam kar kau ka kbd kbh kbp kea kek kg kha kik kin ki kjh kj kk kl kmb kmr km kn koi kok kon koo ko kpv kqn krc kri krl kr ksh kss ksw ks kum ku kvk kv kwn kwy kw kxi ky kzj lad lam la lbe lb ldn lez lfn lg lij lin liv li lkt lld lmo ln lou lo loz lrc lsp ltg lt lua lue lun luo lus luy lu lv lzh lzz mad mai mam map_bms mau max maz mco mcp mdf men me mfe mfs mgm mgr mg mhr mh mic min miq mi mk mlg ml mnc mni mnw mn moh mos mo mrj mrq mr "ms_MY ms" mt mus mvv mwl mww mxv myv my mzn mzy nah nan nap na nba "nb_NO nb nn_NO nn nog no_nb no" nch nci ncj ncs ncx ndc "nds_nl nds" nd new ne ngl ngt ngu ng nhg nhk nhn nia nij niu nlv nl nnh non nov npi nqo nrm nr nso nst nv nya nyk nyn nyu ny nzi oar oc ojb oj oke olo om orm orv or osx os ota ote otk pag pam pan pap pau pa pbb pcd pck pcm pdc pdt pes pfl pid pih pis pi plt pl pms pmy pnb pnt pon pot po ppk ppl prg prl prs pso psp psr ps pys quc que qug qus quw quy qu quz qvi qvz qya rap rar rcf rif rmn rms rmy rm rnd rn rom ro rsl rue run rup ru rw ry sah sat sa sbs scn sco sc sd seh se sfs sfw sgn sgs sg shi shn shs shy sh sid simple si sjn sk sl sma sml sm sna sn som son sop sot so sqk sq "sr_ME sr srp" srm srn ssp ss stq st sux su svk swa swc swg swh sw sxn syr szl "ta_LK ta" tcf tcy tc tdt tdx tet te "tg_TJ tg" thv th tig tir tiv ti tkl tk tlh tll "tl_PH tl" tly tmh tmp tmw tn tob tog toh toi toj toki top to tpi tpw trv tr tsc tss ts tsz ttj tt tum tvl tw tyv ty tzh tzl tzo udm ug uk umb urh "ur_PK ur" usp uz vec vep ve "vi_VN vi" vls vmw vo vro vsl wae wal war wa wba wes wls wlv wol wo wuu xal xho xh xmf xpe yao yap yaq ybb yi yor yo yua zab zai zam za zdj zea zib zlm zne zpa zpg zsl zsm "zul zu" zza
|
||||
|
||||
|
||||
allopus2pivot:
|
||||
for l in ${filter-out ${PIVOT},${OPUSLANGS}}; do \
|
||||
${MAKE} WALLTIME=72 SRCLANGS="$$l" bilingual-dynamic; \
|
||||
done
|
||||
|
||||
## this looks dangerous ....
|
||||
allopus:
|
||||
for s in ${OPUSLANGS}; do \
|
||||
for t in ${OPUSLANGS}; do \
|
||||
if [ ! -e "${WORKHOME}/$$s-$$t/train.submit" ]; then \
|
||||
echo "${MAKE} WALLTIME=72 SRCLANGS=\"$$s\" SRCLANGS=\"$$t\" bilingual-dynamic"; \
|
||||
${MAKE} WALLTIME=72 SRCLANGS="$$s" TRGLANGS="$$t" bilingual-dynamic; \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
all2en:
|
||||
${MAKE} PIVOT=en allopus2pivot
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
enit:
|
||||
${MAKE} SRCLANGS=en TRGLANGS=it traindata-spm
|
||||
${MAKE} SRCLANGS=en TRGLANGS=it devdata-spm
|
||||
${MAKE} SRCLANGS=en TRGLANGS=it wordalign-spm
|
||||
${MAKE} SRCLANGS=en TRGLANGS=it WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
|
||||
|
||||
memad-fiensv:
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi traindata-spm
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi devdata-spm
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi wordalign-spm
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi reverse-data-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi traindata-spm
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi devdata-spm
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi wordalign-spm
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi reverse-data-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
|
||||
memad250-fiensv:
|
||||
${MAKE} CONTEXT_SIZE=250 memad-fiensv_doc
|
||||
|
||||
memad-fiensv_doc:
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi traindata-doc
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi devdata-doc
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi reverse-data-doc
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi traindata-doc
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi devdata-doc
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-doc.submit-multigpu
|
||||
${MAKE} SRCLANGS=en TRGLANGS=fi reverse-data-doc
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-doc.submit-multigpu
|
||||
|
||||
memad-fiensv_more:
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi traindata-doc
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi devdata-doc
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
|
||||
${MAKE} SRCLANGS=sv TRGLANGS=fi reverse-data-doc
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
|
||||
${MAKE} CONTEXT_SIZE=500 memad-fiensv_doc
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
memad:
|
||||
for s in fi en sv de fr nl; do \
|
||||
for t in en fi sv de fr nl; do \
|
||||
if [ "$$s" != "$$t" ]; then \
|
||||
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata devdata wordalign; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g train.submit-multigpu; \
|
||||
fi \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
|
||||
doclevel:
|
||||
${MAKE} ost-datasets
|
||||
${MAKE} traindata-doc-ost
|
||||
${MAKE} devdata-doc-ost
|
||||
${MAKE} wordalign-doc-ost
|
||||
${MAKE} CONTEXT_SIZE=${CONTEXT_SIZE} MODELTYPE=${MODELTYPE} \
|
||||
HPC_CORES=1 WALLTIME=72 HPC_MEM=4g train-doc-ost.submit
|
||||
|
||||
|
||||
fiensv_bpe:
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv traindata-bpe
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv devdata-bpe
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv wordalign-bpe
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-bpe.submit-multigpu
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en traindata-bpe
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en devdata-bpe
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en wordalign-bpe
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-bpe.submit-multigpu
|
||||
|
||||
|
||||
fiensv_spm:
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv traindata-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv devdata-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv wordalign-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=sv WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en traindata-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en devdata-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en wordalign-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=en WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
|
||||
fifr_spm:
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi traindata-spm
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi devdata-spm
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi wordalign-spm
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi reverse-data-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=fr WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
|
||||
fifr_doc:
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi traindata-doc
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi devdata-doc
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
|
||||
${MAKE} SRCLANGS=fr TRGLANGS=fi reverse-data-doc
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=fr WALLTIME=72 HPC_MEM=8g MARIAN_WORKSPACE=20000 HPC_CORES=1 train-doc.submit-multigpu
|
||||
|
||||
|
||||
fide_spm:
|
||||
${MAKE} SRCLANGS=de TRGLANGS=fi traindata-spm
|
||||
${MAKE} SRCLANGS=de TRGLANGS=fi devdata-spm
|
||||
${MAKE} SRCLANGS=de TRGLANGS=fi wordalign-spm
|
||||
${MAKE} SRCLANGS=de TRGLANGS=fi WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
${MAKE} SRCLANGS=de TRGLANGS=fi reverse-data-spm
|
||||
${MAKE} SRCLANGS=fi TRGLANGS=de WALLTIME=72 HPC_MEM=4g HPC_CORES=1 train-spm.submit-multigpu
|
||||
|
||||
|
||||
|
||||
memad_spm:
|
||||
for s in fi en sv de fr nl; do \
|
||||
for t in en fi sv de fr nl; do \
|
||||
if [ "$$s" != "$$t" ]; then \
|
||||
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata-spm; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t devdata-spm; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t wordalign-spm; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g train-spm.submit-multigpu; \
|
||||
fi \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
|
||||
memad_doc:
|
||||
for s in fi en sv; do \
|
||||
for t in en fi sv; do \
|
||||
if [ "$$s" != "$$t" ]; then \
|
||||
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata-doc; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t devdata-doc; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g MODELTYPE=transformer train-doc.submit-multigpu; \
|
||||
fi \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
memad_docalign:
|
||||
for s in fi en sv; do \
|
||||
for t in en fi sv; do \
|
||||
if [ "$$s" != "$$t" ]; then \
|
||||
if ! grep -q 'stalled ${MARIAN_EARLY_STOPPING} times' ${WORKHOME}/$$s-$$t/*.valid${NR.log}; then\
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t traindata-doc; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t devdata-doc; \
|
||||
${MAKE} SRCLANGS=$$s TRGLANGS=$$t HPC_CORES=1 HPC_MEM=4g train-doc.submit-multigpu; \
|
||||
fi \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
|
||||
|
||||
enfisv:
|
||||
${MAKE} SRCLANGS="en fi sv" TRGLANGS="en fi sv" traindata devdata wordalign
|
||||
${MAKE} SRCLANGS="en fi sv" TRGLANGS="en fi sv" HPC_MEM=4g WALLTIME=72 HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
|
||||
|
||||
en-fiet:
|
||||
${MAKE} SRCLANGS="en" TRGLANGS="et fi" traindata devdata
|
||||
${MAKE} SRCLANGS="en" TRGLANGS="et fi" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
${MAKE} TRGLANGS="en" SRCLANGS="et fi" traindata devdata
|
||||
${MAKE} TRGLANGS="en" SRCLANGS="et fi" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
|
||||
|
||||
memad-multi:
|
||||
for s in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="$$s" traindata devdata; \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="$$s" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
done
|
||||
for s in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
|
||||
for t in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
|
||||
if [ "$$s" != "$$t" ]; then \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" traindata devdata; \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
memad-multi2:
|
||||
for s in "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
|
||||
for t in "${SCANDINAVIAN}" "en fr" "et hu fi" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
|
||||
if [ "$$s" != "$$t" ]; then \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" traindata devdata; \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="$$t" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
fi \
|
||||
done \
|
||||
done
|
||||
|
||||
memad-multi3:
|
||||
for s in "${SCANDINAVIAN}" "${WESTGERMANIC}" "ca es fr ga it la oc pt_br pt"; do \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="en" traindata devdata; \
|
||||
${MAKE} SRCLANGS="$$s" TRGLANGS="en" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
${MAKE} SRCLANGS="en" TRGLANGS="$$s" traindata devdata; \
|
||||
${MAKE} SRCLANGS="en" TRGLANGS="$$s" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
done
|
||||
${MAKE} SRCLANGS="en" TRGLANGS="fr" traindata devdata
|
||||
${MAKE} SRCLANGS="en" TRGLANGS="fr" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
${MAKE} SRCLANGS="fr" TRGLANGS="en" traindata devdata
|
||||
${MAKE} SRCLANGS="fr" TRGLANGS="en" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
memad-fi:
|
||||
for l in en sv de fr; do \
|
||||
${MAKE} SRCLANGS=$$l TRGLANGS=fi traindata devdata; \
|
||||
${MAKE} SRCLANGS=$$l TRGLANGS=fi HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
${MAKE} TRGLANGS=$$l SRCLANGS=fi traindata devdata; \
|
||||
${MAKE} TRGLANGS=$$l SRCLANGS=fi HPC_MEM=4g HPC_CORES=1 train.submit-multigpu; \
|
||||
done
|
||||
|
||||
|
||||
|
||||
nordic:
|
||||
${MAKE} SRCLANGS="${SCANDINAVIAN}" TRGLANGS="${FINNO_UGRIC}" traindata
|
||||
${MAKE} SRCLANGS="${SCANDINAVIAN}" TRGLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
${MAKE} TRGLANGS="${SCANDINAVIAN}" SRCLANGS="${FINNO_UGRIC}" traindata
|
||||
${MAKE} TRGLANGS="${SCANDINAVIAN}" SRCLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
romance:
|
||||
${MAKE} SRCLANGS="${ROMANCE}" TRGLANGS="${FINNO_UGRIC}" traindata
|
||||
${MAKE} SRCLANGS="${ROMANCE}" TRGLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
${MAKE} TRGLANGS="${ROMANCE}" SRCLANGS="${FINNO_UGRIC}" traindata
|
||||
${MAKE} TRGLANGS="${ROMANCE}" SRCLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
westgermanic:
|
||||
${MAKE} SRCLANGS="${WESTGERMANIC}" TRGLANGS="${FINNO_UGRIC}" traindata
|
||||
${MAKE} SRCLANGS="${WESTGERMANIC}" TRGLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
${MAKE} TRGLANGS="${WESTGERMANIC}" SRCLANGS="${FINNO_UGRIC}" traindata
|
||||
${MAKE} TRGLANGS="${WESTGERMANIC}" SRCLANGS="${FINNO_UGRIC}" HPC_MEM=4g HPC_CORES=1 train.submit-multigpu
|
||||
|
||||
|
||||
germanic-romance:
|
||||
${MAKE} SRCLANGS="${ROMANCE}" \
|
||||
TRGLANGS="${GERMANIC}" traindata
|
||||
${MAKE} HPC_MEM=4g HPC_CORES=1 SRCLANGS="${ROMANCE}" \
|
||||
TRGLANGS="${GERMANIC}" train.submit-multigpu
|
||||
${MAKE} TRGLANGS="${ROMANCE}" \
|
||||
SRCLANGS="${GERMANIC}" traindata devdata
|
||||
${MAKE} HPC_MEM=4g HPC_CORES=1 TRGLANGS="${ROMANCE}" \
|
||||
SRCLANGS="${GERMANIC}" train.submit-multigpu
|
||||
|
||||
|
54
README.md
Normal file
54
README.md
Normal file
@ -0,0 +1,54 @@
|
||||
# Train Opus-MT models
|
||||
|
||||
This folder includes make targets for training NMT models using MarianNMT and OPUS data. More details are given in the [Makefile](Makefile) but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
|
||||
|
||||
|
||||
## Structure
|
||||
|
||||
Essential files for making new models:
|
||||
|
||||
* `Makefile`: top-level makefile
|
||||
* `Makefile.env`: system-specific environment (now based on CSC machines)
|
||||
* `Makefile.config`: essential model configuration
|
||||
* `Makefile.data`: data pre-processing tasks
|
||||
* `Makefile.doclevel`: experimental document-level models
|
||||
* `Makefile.tasks`: tasks for training specific models and other things (this frequently changes)
|
||||
* `Makefile.dist`: make packages for distributing models (CSC ObjectStorage based)
|
||||
* `Makefile.slurm`: submit jobs with SLURM
|
||||
|
||||
Run this if you want to train a model, for example for translating English to French:
|
||||
|
||||
```
|
||||
make SRCLANG=en TRGLANG=fr train
|
||||
```
|
||||
|
||||
To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:
|
||||
|
||||
```
|
||||
make SRCLANG=en TRGLANG=fr eval
|
||||
```
|
||||
|
||||
For multilingual (more than one language on either side) models run, for example:
|
||||
|
||||
```
|
||||
make SRCLANG="de en" TRGLANG="fr es pt" train
|
||||
make SRCLANG="de en" TRGLANG="fr es pt" eval
|
||||
```
|
||||
|
||||
Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:
|
||||
|
||||
```
|
||||
make -j 8 SRCLANG=en TRGLANG=fr data
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Upload to Object Storage
|
||||
|
||||
|
||||
```
|
||||
swift upload OPUS-MT --changed --skip-identical name-of-file
|
||||
swift post OPUS-MT --read-acl ".r:*"
|
||||
```
|
||||
|
8
TODO.md
Normal file
8
TODO.md
Normal file
@ -0,0 +1,8 @@
|
||||
|
||||
# Things to do
|
||||
|
||||
* add backtranslations to training data
|
||||
* can use monolingual data from tokenized wikipedia dumps: https://sites.google.com/site/rmyeid/projects/polyglot
|
||||
* https://dumps.wikimedia.org/backup-index.html
|
||||
* better in JSON: https://dumps.wikimedia.org/other/cirrussearch/current/
|
||||
|
115
large-context.pl
Executable file
115
large-context.pl
Executable file
@ -0,0 +1,115 @@
|
||||
#!/bin/env perl
|
||||
|
||||
use strict;
|
||||
|
||||
use vars qw($opt_l);
|
||||
use Getopt::Std;
|
||||
getopts('l:');
|
||||
|
||||
my $max = $opt_l || 100;
|
||||
|
||||
my $srcfile = shift(@ARGV);
|
||||
my $trgfile = shift(@ARGV);
|
||||
my $algfile = shift(@ARGV);
|
||||
|
||||
|
||||
if ($srcfile=~/\.gz$/){
|
||||
open S,"gzip -cd <$srcfile |" || die "cannot open $srcfile";
|
||||
}
|
||||
else{ open S,"<$srcfile" || die "cannot open $srcfile"; }
|
||||
if ($trgfile=~/\.gz$/){
|
||||
open T,"gzip -cd <$trgfile |" || die "cannot open $trgfile";
|
||||
}
|
||||
else{ open T,"<$trgfile" || die "cannot open $trgfile"; }
|
||||
if ($algfile=~/\.gz$/){
|
||||
open A,"gzip -cd <$algfile |" || die "cannot open $algfile";
|
||||
}
|
||||
else{ open A,"<$algfile" || die "cannot open $algfile"; }
|
||||
|
||||
|
||||
binmode(S,":utf8");
|
||||
binmode(T,":utf8");
|
||||
binmode(STDOUT,":utf8");
|
||||
|
||||
|
||||
my $srcdoc = '<BEG> ';
|
||||
my $trgdoc = '<BEG> ';
|
||||
my $algdoc = '0-0';
|
||||
|
||||
my $srccount = 0;
|
||||
my $trgcount = 0;
|
||||
my $segcount = 0;
|
||||
|
||||
while (<S>){
|
||||
chomp;
|
||||
my $trg = <T>;
|
||||
my $alg = <A>;
|
||||
chomp($trg);
|
||||
chomp($alg);
|
||||
|
||||
my @srctok = split(/\s+/);
|
||||
my @trgtok = split(/\s+/,$trg);
|
||||
|
||||
if ( ($srccount+@srctok > $max) || ($trgcount+@trgtok > $max) ){
|
||||
$srcdoc .= '<BRK>';
|
||||
$trgdoc .= '<BRK>';
|
||||
$algdoc .= ' ';
|
||||
$algdoc .= $srccount+$segcount+1;
|
||||
$algdoc .= '-';
|
||||
$algdoc .= $trgcount+$segcount+1;
|
||||
print $srcdoc,"\t",$trgdoc,"\t",$algdoc,"\n";
|
||||
$srcdoc = '<CNT> ';
|
||||
$trgdoc = '<CNT> ';
|
||||
$algdoc = '0-0';
|
||||
$srccount = 0;
|
||||
$trgcount = 0;
|
||||
$segcount = 0;
|
||||
}
|
||||
if ( @srctok == 0 && @trgtok == 0 ){
|
||||
$srcdoc .= '<END>';
|
||||
$trgdoc .= '<END>';
|
||||
$algdoc .= ' ';
|
||||
$algdoc .= $srccount+$segcount+1;
|
||||
$algdoc .= '-';
|
||||
$algdoc .= $trgcount+$segcount+1;
|
||||
print $srcdoc,"\t",$trgdoc,"\t",$algdoc,"\n";
|
||||
$srcdoc = '<BEG> ';
|
||||
$trgdoc = '<BEG> ';
|
||||
$algdoc = '0-0';
|
||||
$srccount = 0;
|
||||
$trgcount = 0;
|
||||
$segcount = 0;
|
||||
next;
|
||||
}
|
||||
$srcdoc .= join(' ',@srctok);
|
||||
$trgdoc .= join(' ',@trgtok);
|
||||
$algdoc .= adjust_alignment($alg,$srccount,$trgcount,$segcount);
|
||||
$srcdoc .= ' <SEP> ';
|
||||
$trgdoc .= ' <SEP> ';
|
||||
$srccount += @srctok;
|
||||
$trgcount += @trgtok;
|
||||
$segcount++;
|
||||
$algdoc .= ' ';
|
||||
$algdoc .= $srccount+$segcount;
|
||||
$algdoc .= '-';
|
||||
$algdoc .= $trgcount+$segcount;
|
||||
}
|
||||
|
||||
if ($srcdoc || $trgdoc){
|
||||
print $srcdoc,"\t",$trgdoc,"\n";
|
||||
}
|
||||
|
||||
|
||||
sub adjust_alignment{
|
||||
my ($alg,$srccount,$trgcount,$segcount) = @_;
|
||||
my @links = split(/\s+/,$alg);
|
||||
my @newLinks = ();
|
||||
foreach my $l (@links){
|
||||
my ($s,$t) = split(/\-/,$l);
|
||||
$s += $srccount+$segcount+1;
|
||||
$t += $trgcount+$segcount+1;
|
||||
push(@newLinks,$s.'-'.$t);
|
||||
}
|
||||
return ' '.join(' ',@newLinks) if (@newLinks);
|
||||
return '';
|
||||
}
|
29
models/Makefile
Normal file
29
models/Makefile
Normal file
@ -0,0 +1,29 @@
|
||||
|
||||
|
||||
MODELS = ${shell find . -type f -name '*.zip'}
|
||||
|
||||
|
||||
## fix decoder.yml to match the typical setup
|
||||
## and the names of the model and vocab in the zip file
|
||||
|
||||
fix-config:
|
||||
for m in ${MODELS}; do \
|
||||
f=`unzip -l $$m | grep -oi '[^ ]*npz'`; \
|
||||
v=`unzip -l $$m | grep -oi '[^ ]*vocab.yml'`; \
|
||||
echo 'models:' > decoder.yml; \
|
||||
echo " - $$f" >> decoder.yml; \
|
||||
echo 'vocabs:' >> decoder.yml; \
|
||||
echo " - $$v" >> decoder.yml; \
|
||||
echo " - $$v" >> decoder.yml; \
|
||||
echo 'beam-size: 6' >> decoder.yml; \
|
||||
echo 'normalize: 1' >> decoder.yml; \
|
||||
echo 'word-penalty: 0' >> decoder.yml; \
|
||||
echo 'mini-batch: 1' >> decoder.yml; \
|
||||
echo 'maxi-batch: 1' >> decoder.yml; \
|
||||
echo 'maxi-batch-sort: src' >> decoder.yml; \
|
||||
echo 'relative-paths: true' >> decoder.yml; \
|
||||
zip $$m decoder.yml; \
|
||||
done
|
||||
rm -f decoder.yml
|
||||
|
||||
|
30
models/af-en/README.md
Normal file
30
models/af-en/README.md
Normal file
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.af.en | 55.6 | 0.664 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.af.en | 60.8 | 0.736 |
|
||||
|
15
models/af-fi/README.md
Normal file
15
models/af-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.af.fi | 32.3 | 0.576 |
|
||||
|
15
models/af-fr/README.md
Normal file
15
models/af-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.af.fr | 35.3 | 0.543 |
|
||||
|
15
models/af-sv/README.md
Normal file
15
models/af-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.af.sv | 40.4 | 0.599 |
|
||||
|
15
models/am-en/README.md
Normal file
15
models/am-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/am-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.am.en | 23.5 | 0.492 |
|
||||
|
15
models/am-sv/README.md
Normal file
15
models/am-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/am-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/am-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.am.sv | 21.0 | 0.377 |
|
||||
|
30
models/ar-en/README.md
Normal file
30
models/ar-en/README.md
Normal file
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ar.en | 46.7 | 0.620 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ar.en | 49.4 | 0.661 |
|
||||
|
15
models/ar-fr/README.md
Normal file
15
models/ar-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ar.fr | 43.2 | 0.600 |
|
||||
|
15
models/as-en/README.md
Normal file
15
models/as-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/as-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/as-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/as-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.as.en | 89.3 | 0.901 |
|
||||
|
15
models/az-en/README.md
Normal file
15
models/az-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/az-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/az-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/az-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.az.en | 30.4 | 0.564 |
|
||||
|
15
models/bcl-fi/README.md
Normal file
15
models/bcl-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bcl.fi | 33.3 | 0.573 |
|
||||
|
15
models/bcl-fr/README.md
Normal file
15
models/bcl-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bcl.fr | 35.0 | 0.527 |
|
||||
|
15
models/bcl-sv/README.md
Normal file
15
models/bcl-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bcl-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bcl-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bcl.sv | 38.0 | 0.565 |
|
||||
|
15
models/bem-en/README.md
Normal file
15
models/bem-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bem.en | 33.4 | 0.491 |
|
||||
|
15
models/bem-fi/README.md
Normal file
15
models/bem-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bem.fi | 22.8 | 0.439 |
|
||||
|
15
models/bem-fr/README.md
Normal file
15
models/bem-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bem.fr | 25.0 | 0.417 |
|
||||
|
15
models/bem-sv/README.md
Normal file
15
models/bem-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bem-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bem-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bem.sv | 25.6 | 0.434 |
|
||||
|
15
models/ber-en/README.md
Normal file
15
models/ber-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ber-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ber.en | 37.3 | 0.566 |
|
||||
|
15
models/ber-fr/README.md
Normal file
15
models/ber-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ber-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ber-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ber.fr | 60.2 | 0.754 |
|
||||
|
30
models/bg-en/README.md
Normal file
30
models/bg-en/README.md
Normal file
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.bg.en | 61.6 | 0.718 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.bg.en | 59.4 | 0.727 |
|
||||
|
15
models/bg-fi/README.md
Normal file
15
models/bg-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bg.fi | 23.7 | 0.505 |
|
||||
|
15
models/bg-fr/README.md
Normal file
15
models/bg-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| GlobalVoices.bg.fr | 20.9 | 0.480 |
|
||||
|
15
models/bg-sv/README.md
Normal file
15
models/bg-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bg.sv | 29.1 | 0.494 |
|
||||
|
30
models/bn-en/README.md
Normal file
30
models/bn-en/README.md
Normal file
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.bn.en | 53.3 | 0.639 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.bn.en | 49.8 | 0.644 |
|
||||
|
15
models/br-en/README.md
Normal file
15
models/br-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/br-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/br-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/br-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.br.en | 86.3 | 0.917 |
|
||||
|
15
models/bs-en/README.md
Normal file
15
models/bs-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/bs-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bs-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bs-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.bs.en | 33.3 | 0.536 |
|
||||
|
15
models/bzs-en/README.md
Normal file
15
models/bzs-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bzs.en | 44.5 | 0.605 |
|
||||
|
15
models/bzs-fi/README.md
Normal file
15
models/bzs-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bzs.fi | 24.7 | 0.464 |
|
||||
|
15
models/bzs-fr/README.md
Normal file
15
models/bzs-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bzs.fr | 30.0 | 0.479 |
|
||||
|
15
models/bzs-sv/README.md
Normal file
15
models/bzs-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/bzs-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bzs-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.bzs.sv | 30.7 | 0.489 |
|
||||
|
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca+es+fr+ga+it+la+oc+pt_br+pt-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca+es+fr+ga+it+la+oc+pt_br+pt-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca+es+fr+ga+it+la+oc+pt_br+pt-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newssyscomb2009.es.fr | 29.6 | 0.561 |
|
||||
| newssyscomb2009.fr.es | 30.3 | 0.561 |
|
||||
| news-test2008.es.fr | 27.9 | 0.538 |
|
||||
| news-test2008.fr.es | 28.6 | 0.538 |
|
||||
| newstest2009.es.fr | 26.3 | 0.537 |
|
||||
| newstest2009.fr.es | 27.1 | 0.537 |
|
||||
| newstest2010.es.fr | 30.2 | 0.563 |
|
||||
| newstest2010.fr.es | 30.9 | 0.563 |
|
||||
| newstest2011.es.fr | 29.3 | 0.552 |
|
||||
| newstest2011.fr.es | 30.1 | 0.552 |
|
||||
| newstest2012.es.fr | 29.5 | 0.553 |
|
||||
| newstest2012.fr.es | 30.2 | 0.553 |
|
||||
| newstest2013.es.fr | 27.6 | 0.536 |
|
||||
| newstest2013.fr.es | 28.4 | 0.536 |
|
||||
| Tatoeba.ca.pt | 50.7 | 0.659 |
|
||||
|
15
models/ca-en/README.md
Normal file
15
models/ca-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ca.en | 51.4 | 0.678 |
|
||||
|
15
models/ca-fr/README.md
Normal file
15
models/ca-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.ca.fr | 50.4 | 0.672 |
|
||||
|
15
models/ceb-en/README.md
Normal file
15
models/ceb-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ceb.en | 52.6 | 0.670 |
|
||||
|
15
models/ceb-fi/README.md
Normal file
15
models/ceb-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ceb.fi | 27.4 | 0.525 |
|
||||
|
15
models/ceb-fr/README.md
Normal file
15
models/ceb-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ceb.fr | 30.0 | 0.491 |
|
||||
|
15
models/ceb-sv/README.md
Normal file
15
models/ceb-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ceb-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ceb-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ceb.sv | 35.5 | 0.552 |
|
||||
|
15
models/chk-en/README.md
Normal file
15
models/chk-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/chk-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.chk.en | 31.2 | 0.465 |
|
||||
|
15
models/chk-fr/README.md
Normal file
15
models/chk-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/chk-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.chk.fr | 22.4 | 0.387 |
|
||||
|
15
models/chk-sv/README.md
Normal file
15
models/chk-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/chk-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/chk-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.chk.sv | 23.6 | 0.406 |
|
||||
|
15
models/crs-en/README.md
Normal file
15
models/crs-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.crs.en | 42.9 | 0.589 |
|
||||
|
15
models/crs-fi/README.md
Normal file
15
models/crs-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.crs.fi | 25.6 | 0.479 |
|
||||
|
15
models/crs-fr/README.md
Normal file
15
models/crs-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.crs.fr | 29.4 | 0.475 |
|
||||
|
15
models/crs-sv/README.md
Normal file
15
models/crs-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/crs-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/crs-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.crs.sv | 29.3 | 0.480 |
|
||||
|
40
models/cs-en/README.md
Normal file
40
models/cs-en/README.md
Normal file
@ -0,0 +1,40 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newstest2014-csen.cs.en | 31.5 | 0.589 |
|
||||
| newstest2015-encs.cs.en | 27.5 | 0.540 |
|
||||
| newstest2016-encs.cs.en | 28.5 | 0.561 |
|
||||
| newstest2017-encs.cs.en | 26.6 | 0.540 |
|
||||
| newstest2018-encs.cs.en | 27.1 | 0.540 |
|
||||
| Tatoeba.cs.en | 62.5 | 0.743 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newstest2014-csen.cs.en | 34.1 | 0.612 |
|
||||
| newstest2015-encs.cs.en | 30.4 | 0.565 |
|
||||
| newstest2016-encs.cs.en | 31.8 | 0.584 |
|
||||
| newstest2017-encs.cs.en | 28.7 | 0.556 |
|
||||
| newstest2018-encs.cs.en | 30.3 | 0.566 |
|
||||
| Tatoeba.cs.en | 58.0 | 0.721 |
|
||||
|
15
models/cs-fi/README.md
Normal file
15
models/cs-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.cs.fi | 25.5 | 0.523 |
|
||||
|
15
models/cs-fr/README.md
Normal file
15
models/cs-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| GlobalVoices.cs.fr | 21.0 | 0.488 |
|
||||
|
15
models/cs-sv/README.md
Normal file
15
models/cs-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.cs.sv | 30.6 | 0.527 |
|
||||
|
30
models/cy-en/README.md
Normal file
30
models/cy-en/README.md
Normal file
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.cy.en | 41.8 | 0.597 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.cy.en | 33.0 | 0.525 |
|
||||
|
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-ca+es+fr+ga+it+la+oc+pt_br+pt/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.pt | 52.1 | 0.684 |
|
||||
|
32
models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/README.md
Normal file
32
models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/README.md
Normal file
@ -0,0 +1,32 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.sv | 70.7 | 0.824 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.sv | 69.2 | 0.811 |
|
||||
|
16
models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/README.md
Normal file
16
models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+af+fy+nl/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.nl | 51.6 | 0.690 |
|
||||
|
16
models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/README.md
Normal file
16
models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-de+nl+af+fy/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.fy | 50.3 | 0.687 |
|
||||
|
16
models/da+fo+is+no+nb+nn+sv-en+fr/README.md
Normal file
16
models/da+fo+is+no+nb+nn+sv-en+fr/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en+fr/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en+fr/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en+fr/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.fr | 61.3 | 0.736 |
|
||||
|
15
models/da+fo+is+no+nb+nn+sv-en/README.md
Normal file
15
models/da+fo+is+no+nb+nn+sv-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.en | 61.3 | 0.743 |
|
||||
|
17
models/da+fo+is+no+nb+nn+sv-et+fi/README.md
Normal file
17
models/da+fo+is+no+nb+nn+sv-et+fi/README.md
Normal file
@ -0,0 +1,17 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+fi/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+fi/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+fi/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| fiskmo_testset.sv.fi | 22.4 | 0.590 |
|
||||
| Tatoeba.da.fi | 40.4 | 0.637 |
|
||||
|
17
models/da+fo+is+no+nb+nn+sv-et+hu+fi/README.md
Normal file
17
models/da+fo+is+no+nb+nn+sv-et+hu+fi/README.md
Normal file
@ -0,0 +1,17 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+hu+fi/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+hu+fi/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et+hu+fi/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| fiskmo_testset.sv.fi | 22.9 | 0.583 |
|
||||
| Tatoeba.da.fi | 39.8 | 0.632 |
|
||||
|
15
models/da+fo+is+no+nb+nn+sv-et/README.md
Normal file
15
models/da+fo+is+no+nb+nn+sv-et/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-et/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.et | 37.8 | 0.592 |
|
||||
|
16
models/da+fo+is+no+nb+nn+sv-fi/README.md
Normal file
16
models/da+fo+is+no+nb+nn+sv-fi/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fi/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fi/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fi/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| fiskmo_testset.sv.fi | 25.7 | 0.605 |
|
||||
| Tatoeba.da.fi | 41.7 | 0.643 |
|
||||
|
15
models/da+fo+is+no+nb+nn+sv-fr/README.md
Normal file
15
models/da+fo+is+no+nb+nn+sv-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fr/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fr/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da+fo+is+no+nb+nn+sv-fr/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.fr | 63.4 | 0.732 |
|
||||
|
30
models/da-en/README.md
Normal file
30
models/da-en/README.md
Normal file
@ -0,0 +1,30 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.en | 65.1 | 0.774 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.en | 63.6 | 0.769 |
|
||||
|
15
models/da-fi/README.md
Normal file
15
models/da-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.fi | 39.0 | 0.629 |
|
||||
|
15
models/da-fr/README.md
Normal file
15
models/da-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.da.fr | 62.2 | 0.751 |
|
||||
|
16
models/de+af+fy+nl-de+af+fy+nl/README.md
Normal file
16
models/de+af+fy+nl-de+af+fy+nl/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-de+af+fy+nl/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-de+af+fy+nl/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-de+af+fy+nl/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.nl | 51.2 | 0.681 |
|
||||
|
28
models/de+af+fy+nl-en/README.md
Normal file
28
models/de+af+fy+nl-en/README.md
Normal file
@ -0,0 +1,28 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newssyscomb2009.de.en | 26.1 | 0.529 |
|
||||
| news-test2008.de.en | 24.4 | 0.515 |
|
||||
| newstest2009.de.en | 23.9 | 0.514 |
|
||||
| newstest2010.de.en | 26.9 | 0.547 |
|
||||
| newstest2011.de.en | 25.0 | 0.521 |
|
||||
| newstest2012.de.en | 26.6 | 0.531 |
|
||||
| newstest2013.de.en | 28.8 | 0.546 |
|
||||
| newstest2014-deen.de.en | 28.3 | 0.547 |
|
||||
| newstest2015-ende.de.en | 29.0 | 0.548 |
|
||||
| newstest2016-ende.de.en | 34.0 | 0.595 |
|
||||
| newstest2017-ende.de.en | 29.4 | 0.559 |
|
||||
| newstest2018-ende.de.en | 36.5 | 0.605 |
|
||||
| newstest2019-deen.de.en | 33.6 | 0.584 |
|
||||
| Tatoeba.de.en | 42.0 | 0.631 |
|
||||
|
16
models/de+af+fy+nl-et+fi/README.md
Normal file
16
models/de+af+fy+nl-et+fi/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-et+fi/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-et+fi/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-et+fi/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.fi | 38.8 | 0.612 |
|
||||
|
24
models/de+af+fy+nl-fr/README.md
Normal file
24
models/de+af+fy+nl-fr/README.md
Normal file
@ -0,0 +1,24 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-fr/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-fr/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+af+fy+nl-fr/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| euelections_dev2019.transformer.de | 22.2 | 0.530 |
|
||||
| newssyscomb2009.de.fr | 18.6 | 0.500 |
|
||||
| news-test2008.de.fr | 18.4 | 0.495 |
|
||||
| newstest2009.de.fr | 18.2 | 0.486 |
|
||||
| newstest2010.de.fr | 19.9 | 0.514 |
|
||||
| newstest2011.de.fr | 18.8 | 0.493 |
|
||||
| newstest2012.de.fr | 19.5 | 0.495 |
|
||||
| newstest2013.de.fr | 20.1 | 0.498 |
|
||||
| newstest2019-defr.de.fr | 24.3 | 0.554 |
|
||||
| Tatoeba.de.fr | 29.4 | 0.570 |
|
||||
|
16
models/de+nl+af+fy-de+nl+af+fy/README.md
Normal file
16
models/de+nl+af+fy-de+nl+af+fy/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-de+nl+af+fy/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-de+nl+af+fy/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-de+nl+af+fy/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.fy | 51.7 | 0.691 |
|
||||
|
28
models/de+nl+af+fy-en/README.md
Normal file
28
models/de+nl+af+fy-en/README.md
Normal file
@ -0,0 +1,28 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+af+fy-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newssyscomb2009.de.en | 25.6 | 0.525 |
|
||||
| news-test2008.de.en | 24.2 | 0.515 |
|
||||
| newstest2009.de.en | 23.7 | 0.513 |
|
||||
| newstest2010.de.en | 26.8 | 0.548 |
|
||||
| newstest2011.de.en | 24.9 | 0.522 |
|
||||
| newstest2012.de.en | 26.3 | 0.529 |
|
||||
| newstest2013.de.en | 28.8 | 0.546 |
|
||||
| newstest2014-deen.de.en | 28.3 | 0.548 |
|
||||
| newstest2015-ende.de.en | 28.9 | 0.549 |
|
||||
| newstest2016-ende.de.en | 33.9 | 0.595 |
|
||||
| newstest2017-ende.de.en | 29.4 | 0.558 |
|
||||
| newstest2018-ende.de.en | 36.2 | 0.603 |
|
||||
| newstest2019-deen.de.en | 33.8 | 0.585 |
|
||||
| Tatoeba.de.en | 41.6 | 0.626 |
|
||||
|
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.sv | 48.1 | 0.663 |
|
||||
|
56
models/de-en/README.md
Normal file
56
models/de-en/README.md
Normal file
@ -0,0 +1,56 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newssyscomb2009.de.en | 26.9 | 0.535 |
|
||||
| news-test2008.de.en | 24.8 | 0.519 |
|
||||
| newstest2009.de.en | 24.6 | 0.519 |
|
||||
| newstest2010.de.en | 27.5 | 0.552 |
|
||||
| newstest2011.de.en | 25.6 | 0.526 |
|
||||
| newstest2012.de.en | 27.1 | 0.535 |
|
||||
| newstest2013.de.en | 29.4 | 0.551 |
|
||||
| newstest2014-deen.de.en | 29.1 | 0.553 |
|
||||
| newstest2015-ende.de.en | 29.7 | 0.556 |
|
||||
| newstest2016-ende.de.en | 34.8 | 0.600 |
|
||||
| newstest2017-ende.de.en | 29.9 | 0.563 |
|
||||
| newstest2018-ende.de.en | 37.4 | 0.611 |
|
||||
| newstest2019-deen.de.en | 34.3 | 0.587 |
|
||||
| Tatoeba.de.en | 54.5 | 0.677 |
|
||||
|
||||
# opus-2019-12-18.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-18.zip)
|
||||
* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-18.test.txt)
|
||||
* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-18.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newssyscomb2009.de.en | 28.6 | 0.553 |
|
||||
| news-test2008.de.en | 27.6 | 0.547 |
|
||||
| newstest2009.de.en | 26.9 | 0.544 |
|
||||
| newstest2010.de.en | 30.4 | 0.585 |
|
||||
| newstest2011.de.en | 27.5 | 0.554 |
|
||||
| newstest2012.de.en | 29.0 | 0.567 |
|
||||
| newstest2013.de.en | 32.2 | 0.583 |
|
||||
| newstest2014-deen.de.en | 33.8 | 0.596 |
|
||||
| newstest2015-ende.de.en | 34.3 | 0.598 |
|
||||
| newstest2016-ende.de.en | 40.1 | 0.646 |
|
||||
| newstest2017-ende.de.en | 35.6 | 0.609 |
|
||||
| newstest2018-ende.de.en | 43.8 | 0.667 |
|
||||
| newstest2019-deen.de.en | 39.6 | 0.637 |
|
||||
| Tatoeba.de.en | 55.1 | 0.704 |
|
||||
|
63
models/de-fi/README.md
Normal file
63
models/de-fi/README.md
Normal file
@ -0,0 +1,63 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.fi | 40.1 | 0.624 |
|
||||
|
||||
|
||||
|
||||
# goethe-2019-11-15.zip
|
||||
|
||||
* dataset: opus+goethe
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [goethe-2019-11-15.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/goethe-2019-11-15.zip)
|
||||
* info: trained on OPUS and fine-tuned for 6 epochs on data from the Goethe Institute
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| goethe.de.fi | 39.26 | |
|
||||
|
||||
|
||||
# goethe-2020-01-07.zip
|
||||
|
||||
* dataset: opus+goethe
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [goethe-2019-11-15.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/goethe-2020-01-07.zip)
|
||||
* info: trained on OPUS and fine-tuned for 3 epochs on data from the Goethe Institute without duplicates
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| goethe.de.fi | 38.57 | |
|
||||
|
||||
|
||||
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.fi | 40.0 | 0.628 |
|
||||
|
48
models/de-fr/README.md
Normal file
48
models/de-fr/README.md
Normal file
@ -0,0 +1,48 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| euelections_dev2019.transformer.de | 22.5 | 0.531 |
|
||||
| newssyscomb2009.de.fr | 18.8 | 0.500 |
|
||||
| news-test2008.de.fr | 18.4 | 0.494 |
|
||||
| newstest2009.de.fr | 18.4 | 0.487 |
|
||||
| newstest2010.de.fr | 20.2 | 0.517 |
|
||||
| newstest2011.de.fr | 18.9 | 0.494 |
|
||||
| newstest2012.de.fr | 19.6 | 0.497 |
|
||||
| newstest2013.de.fr | 20.4 | 0.502 |
|
||||
| newstest2019-defr.de.fr | 24.5 | 0.557 |
|
||||
| Tatoeba.de.fr | 54.8 | 0.666 |
|
||||
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| euelections_dev2019.transformer-align.de | 32.2 | 0.590 |
|
||||
| newssyscomb2009.de.fr | 26.8 | 0.553 |
|
||||
| news-test2008.de.fr | 26.4 | 0.548 |
|
||||
| newstest2009.de.fr | 25.6 | 0.539 |
|
||||
| newstest2010.de.fr | 29.1 | 0.572 |
|
||||
| newstest2011.de.fr | 26.9 | 0.551 |
|
||||
| newstest2012.de.fr | 27.7 | 0.554 |
|
||||
| newstest2013.de.fr | 29.5 | 0.560 |
|
||||
| newstest2019-defr.de.fr | 36.6 | 0.625 |
|
||||
| Tatoeba.de.fr | 49.2 | 0.664 |
|
||||
|
15
models/de-nl/README.md
Normal file
15
models/de-nl/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.nl | 52.6 | 0.697 |
|
||||
|
15
models/de-sv/README.md
Normal file
15
models/de-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-sv/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-sv/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-sv/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.de.sv | 55.0 | 0.699 |
|
||||
|
16
models/ee-en/README.md
Normal file
16
models/ee-en/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ee-en/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-en/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-en/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ee.en | 39.3 | 0.556 |
|
||||
| Tatoeba.ee.en | 21.2 | 0.569 |
|
||||
|
15
models/ee-fr/README.md
Normal file
15
models/ee-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ee-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ee.fr | 27.1 | 0.450 |
|
||||
|
15
models/ee-sv/README.md
Normal file
15
models/ee-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/ee-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ee-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.ee.sv | 28.9 | 0.472 |
|
||||
|
15
models/efi-fi/README.md
Normal file
15
models/efi-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/efi-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.efi.fi | 23.6 | 0.450 |
|
||||
|
15
models/efi-sv/README.md
Normal file
15
models/efi-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/efi-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/efi-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.efi.sv | 26.8 | 0.447 |
|
||||
|
15
models/el-en/README.md
Normal file
15
models/el-en/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-en/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-en/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-en/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.el.en | 69.4 | 0.801 |
|
||||
|
15
models/el-fi/README.md
Normal file
15
models/el-fi/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-fi/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fi/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fi/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| JW300.el.fi | 25.3 | 0.517 |
|
||||
|
15
models/el-fr/README.md
Normal file
15
models/el-fr/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-fr/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fr/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-fr/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.el.fr | 63.0 | 0.741 |
|
||||
|
15
models/el-sv/README.md
Normal file
15
models/el-sv/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# opus-2020-01-08.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer-align
|
||||
* pre-processing: normalization + SentencePiece
|
||||
* download: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/el-sv/opus-2020-01-08.zip)
|
||||
* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-sv/opus-2020-01-08.test.txt)
|
||||
* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/el-sv/opus-2020-01-08.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| GlobalVoices.el.sv | 23.6 | 0.498 |
|
||||
|
@ -0,0 +1,41 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/en+de+nl+fy+af+da+fo+is+no+nb+nn+sv-en+de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+de+nl+fy+af+da+fo+is+no+nb+nn+sv-en+de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+de+nl+fy+af+da+fo+is+no+nb+nn+sv-en+de+nl+fy+af+da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| newssyscomb2009.de.en | 19.9 | 0.505 |
|
||||
| newssyscomb2009.en.de | 19.9 | 0.505 |
|
||||
| news-test2008.de.en | 20.1 | 0.494 |
|
||||
| news-test2008.en.de | 20.1 | 0.494 |
|
||||
| newstest2009.de.en | 19.1 | 0.496 |
|
||||
| newstest2009.en.de | 19.1 | 0.496 |
|
||||
| newstest2010.de.en | 21.0 | 0.506 |
|
||||
| newstest2010.en.de | 21.0 | 0.506 |
|
||||
| newstest2011.de.en | 19.3 | 0.486 |
|
||||
| newstest2011.en.de | 19.3 | 0.486 |
|
||||
| newstest2012.de.en | 19.6 | 0.487 |
|
||||
| newstest2012.en.de | 19.6 | 0.487 |
|
||||
| newstest2013.de.en | 23.0 | 0.512 |
|
||||
| newstest2013.en.de | 23.0 | 0.512 |
|
||||
| newstest2014-deen.de.en | 26.3 | 0.535 |
|
||||
| newstest2015-ende.de.en | 26.6 | 0.540 |
|
||||
| newstest2015-ende.en.de | 26.6 | 0.540 |
|
||||
| newstest2016-ende.de.en | 29.1 | 0.569 |
|
||||
| newstest2016-ende.en.de | 29.1 | 0.569 |
|
||||
| newstest2017-ende.de.en | 24.5 | 0.528 |
|
||||
| newstest2017-ende.en.de | 24.5 | 0.528 |
|
||||
| newstest2018-ende.de.en | 34.9 | 0.604 |
|
||||
| newstest2018-ende.en.de | 34.9 | 0.604 |
|
||||
| newstest2019-deen.de.en | 31.3 | 0.569 |
|
||||
| newstest2019-ende.en.de | 32.2 | 0.578 |
|
||||
| Tatoeba.en.sv | 43.3 | 0.630 |
|
||||
|
16
models/en+fr-da+fo+is+no+nb+nn+sv/README.md
Normal file
16
models/en+fr-da+fo+is+no+nb+nn+sv/README.md
Normal file
@ -0,0 +1,16 @@
|
||||
# opus-2019-12-04.zip
|
||||
|
||||
* dataset: opus
|
||||
* model: transformer
|
||||
* pre-processing: normalization + tokenization + BPE
|
||||
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
||||
* download: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/en+fr-da+fo+is+no+nb+nn+sv/opus-2019-12-04.zip)
|
||||
* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+fr-da+fo+is+no+nb+nn+sv/opus-2019-12-04.test.txt)
|
||||
* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en+fr-da+fo+is+no+nb+nn+sv/opus-2019-12-04.eval.txt)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
| testset | BLEU | chr-F |
|
||||
|-----------------------|-------|-------|
|
||||
| Tatoeba.en.sv | 53.0 | 0.685 |
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user