Commit Graph

178 Commits

Author SHA1 Message Date
Joerg Tiedemann
c6356d3a8a back to yml vocab files as default 2020-09-25 09:58:25 +03:00
Traubert
9ee38fe355 Suppress warnings when testing for missing Perl modules 2020-09-24 13:25:14 +03:00
Traubert
ec69daa989 Some puhti modules needed for installing 2020-09-24 13:18:18 +03:00
Traubert
1e349f269c Fix type 2020-09-24 12:15:08 +03:00
Joerg Tiedemann
f9a44bdb99 merged 2020-09-18 23:08:56 +03:00
Joerg Tiedemann
87a5354de5 changes to tatoeba recipes 2020-09-18 23:05:46 +03:00
Jörg Tiedemann
a61cf48443 add option to skip sentence piecce vocabs but use marian_vocab instead 2020-09-16 19:33:19 +03:00
tiedemann
913d31472e
Merge pull request #25 from Helsinki-NLP/sam-fixes
Fix typo
2020-09-16 09:28:46 +03:00
Joerg Tiedemann
c564bd1f56 fix in fetching data for Sami languages 2020-09-16 09:25:36 +03:00
Traubert
1da54c4155 Fix typo 2020-09-15 15:02:20 +03:00
Jörg Tiedemann
58fbf0bdd8 back to old subword model names 2020-09-14 08:53:57 +03:00
Jörg Tiedemann
c2798e9758 plain text vocab files from spm models 2020-09-13 22:17:21 +03:00
Jörg Tiedemann
24e92de56a proper release packages for models with internal sentence piece vocabs 2020-09-13 00:00:15 +03:00
Jörg Tiedemann
666b2b8462 internal sentence piece models in transformers 2020-09-12 16:16:01 +03:00
Jörg Tiedemann
ddafb43d66 removed dependence on moses tools in preprocessing script for released spm packages 2020-09-12 14:42:10 +03:00
Jörg Tiedemann
c0cb356417 added acknowledgements 2020-09-12 12:01:02 +03:00
Jörg Tiedemann
16eef8e45d moved project makefiles to lib/projects 2020-09-10 12:12:44 +03:00
Jörg Tiedemann
1a6e29275d dev data is now uniq to avoid overlaps with test data 2020-09-09 23:21:07 +03:00
Jörg Tiedemann
3735af4ec1 more documentation 2020-09-07 23:00:01 +03:00
Jörg Tiedemann
a47c292152 pivoting and documentation 2020-09-05 22:19:00 +03:00
Jörg Tiedemann
ad828c3124 started tutorial and fixes to backtranslate makefile 2020-09-05 00:16:22 +03:00
Tiedemann Jörg
d11f74ce41 added bpe submodule 2020-09-04 15:34:20 +03:00
Tiedemann
96eaad2d05 added possibility to fetch moses file from ObjectStore (instead of reading with opus_read) 2020-09-03 22:04:44 +03:00
Joerg Tiedemann
971ece9606 fix tatoeba data labels 2020-09-03 07:55:44 +03:00
Tiedemann
1435b7849a moved allas recipes to a different makefile 2020-09-02 16:35:35 +03:00
Tiedemann
2332732577 make compatible with mac osx and include submodules for required tools 2020-09-02 15:52:34 +03:00
Joerg Tiedemann
639bd2adda started documentation of project specific models 2020-08-28 15:51:37 +03:00
Joerg Tiedemann
2c04e48dbe fixed an important bug in data merging 2020-08-28 11:52:46 +03:00
Joerg Tiedemann
e31550a3ad enabled fetching OPUS data instead of reading local files if necessary 2020-08-28 10:53:11 +03:00
Joerg Tiedemann
94eeec13eb take away dependence on local OPUS files for finding data 2020-08-27 22:36:50 +03:00
Joerg Tiedemann
831ee89f76 fixed bug in env.mk 2020-08-26 22:18:12 +03:00
Joerg Tiedemann
596dd993a5 more documentation 2020-08-26 21:45:03 +03:00
Joerg Tiedemann
a8b54f5311 some info about training added 2020-08-26 15:12:38 +03:00
Joerg Tiedemann
2f8a37cc92 more details about data compilation added 2020-08-26 14:31:50 +03:00
Joerg Tiedemann
dac6070069 started some more documentation 2020-08-26 09:59:24 +03:00
Joerg Tiedemann
f2a413b740 minor cleanup in env 2020-08-26 01:01:44 +03:00
Joerg Tiedemann
4c35456038 cleanup in data makefile 2020-08-26 00:44:02 +03:00
Joerg Tiedemann
9375f37886 missing makefile added 2020-08-25 22:42:33 +03:00
Joerg Tiedemann
0e27198048 store and fetch work data 2020-08-22 23:51:37 +03:00
Joerg Tiedemann
d7252e32b7 tatoeba monolingual data 2020-08-05 00:00:24 +03:00
Joerg Tiedemann
6bf0207cc6 list of models added 2020-08-03 11:58:51 +03:00
Joerg Tiedemann
c9fcb7f35d tatoeba langgroup models 2020-08-02 11:38:42 +03:00
Joerg Tiedemann
1b913277b3 tatoeba language group models with various sample sizews 2020-07-25 22:52:33 +03:00
Joerg Tiedemann
5493aeddb4 fixed a problem with lang group targets 2020-07-14 21:40:49 +03:00
Joerg Tiedemann
068b82cc1d fixed a bug in eval/dist groups 2020-07-14 13:29:06 +03:00
Joerg Tiedemann
e2edc4195a result tables for language groups and minor fixes for start scripts in Tatoeba challenge 2020-07-10 11:59:37 +03:00
Joerg Tiedemann
ec6d7c7142 tatoeba langgroups 2020-07-04 23:37:39 +03:00
Joerg Tiedemann
7df91a9eaa language group jobs with some more documentation 2020-06-29 12:26:45 +03:00
Joerg Tiedemann
62c9414122 lang groups 2020-06-29 00:15:35 +03:00
Joerg Tiedemann
46a0b2b15a fixed dist-packaging 2020-06-27 13:56:51 +03:00
Joerg Tiedemann
e2bc2acb3b re-organised targets for multilingual models of language groups 2020-06-27 12:29:50 +03:00
Joerg Tiedemann
9e186d82d6 bugfix in tatoeba data extraction for multilingual data files (language code clash) 2020-06-25 00:45:25 +03:00
Joerg Tiedemann
844f8bf72a removed unnecessary pre-processing for chinese 2020-06-19 16:12:06 +03:00
Joerg Tiedemann
b7f45e2a74 more details in model config 2020-06-18 20:50:22 +03:00
Joerg Tiedemann
4e18da6e4c fix chinese/korean/japanese language codes 2020-06-17 22:02:39 +03:00
Joerg Tiedemann
e141772b34 fixed multilingual tatoeba evaluation 2020-06-11 00:54:40 +03:00
Joerg Tiedemann
cc16be10d4 final fixes to multilingual tatoeba model scripts 2020-06-09 11:19:58 +03:00
Joerg Tiedemann
b7691875c2 tatoeba models now operational 2020-06-09 00:12:16 +03:00
Joerg Tiedemann
035cca7c1a fixed tatoeba model scripts 2020-06-08 17:24:39 +03:00
Joerg Tiedemann
e07eb14984 fit-data-size fixed 2020-06-08 14:14:55 +03:00
Joerg Tiedemann
6cb9959e82 tatoeba challenge model scripts updated 2020-06-06 20:49:54 +03:00
Joerg Tiedemann
edaf361803 multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00
Joerg Tiedemann
c44e92d52a fixed bug in tatoeba model call 2020-06-03 01:09:28 +03:00
Joerg Tiedemann
eeaef7768c tatoeba models added 2020-06-03 00:16:21 +03:00
Joerg Tiedemann
ec43fcd30a fixed a bug in eval-testsets 2020-05-29 14:43:36 +03:00
Joerg Tiedemann
d0a217cf40 wikimatrix models added 2020-05-21 20:51:38 +03:00
Joerg Tiedemann
716d7b52c1 fixed testset names and backtranslation sentence splitting 2020-05-20 23:19:48 +03:00
Joerg Tiedemann
04d72ff8ed fixes with pivoting 2020-05-18 21:36:53 +03:00
Joerg Tiedemann
b01b4f22c3 pivot-based translations added 2020-05-17 22:43:05 +03:00
Joerg Tiedemann
1246bcd271 added some size info to train data README 2020-05-17 01:21:57 +03:00
Joerg Tiedemann
37a83a9eba information about license for pre-trained models added 2020-05-15 20:01:07 +03:00
Joerg Tiedemann
cb3b77573e make it possible to exclude certain data sets 2020-05-14 10:36:46 +03:00
Joerg Tiedemann
7ef908dcd7 translate with backtranslations 2020-05-13 00:41:07 +03:00
Joerg Tiedemann
e4455e510a a bit more info added for data sets 2020-05-09 22:33:33 +03:00
Joerg Tiedemann
d4b71e0261 fixed includes in backtranslate/evaluate/finetune makefiles 2020-05-07 22:51:31 +03:00
Joerg Tiedemann
c703bb4c2b fixed file name for wikimedia.mk and added memad-multi model 2020-05-07 19:55:28 +03:00
Joerg Tiedemann
5404f515aa new makefile structure 2020-05-03 21:46:30 +03:00
Joerg Tiedemann
6b8e69269a better division of the massive tasks makefile 2020-05-03 20:27:55 +03:00