mirror of https://github.com/marian-nmt/marian.git synced 2024-12-13 05:41:06 +03:00

Fast Neural Machine Translation in C++

cuda fast gpu neural-machine-translation starred-marian-nmt-repo starred-repo

Go to file

Hieu Hoang 1fd752ada3 .cu -> .cpp		2016-09-02 22:33:40 +01:00
amunmt	add cu & cpp	2016-09-02 21:32:34 +01:00
cmake	Towards YAML configurations	2016-04-28 20:00:43 +02:00
notebooks	typos	2016-06-14 01:03:07 +02:00
scripts	check for matching weights and scorers	2016-05-02 21:01:34 +02:00
src	.cu -> .cpp	2016-09-02 22:33:40 +01:00
.gitignore	Add python's pyc files to ignore	2016-04-15 16:27:48 +02:00
CMakeLists.txt	compiles on OSX	2016-09-01 19:20:52 +01:00
LICENSE	word wrap	2016-05-01 15:45:19 +02:00
README.md	Update README.md	2016-08-30 20:00:36 +01:00

README.md

AmuNMT

A C++ decoder for Neural Machine Translation (NMT) models trained with Theano-based scripts from Nematus (https://github.com/rsennrich/nematus) or DL4MT (https://github.com/nyu-dl/dl4mt-tutorial)

We aim at keeping compatibility with Nematus (at least as long as there is no training framework in AmunNMT), the continued compatbility with DL4MT will not be guaranteed.

Requirements:

CMake 3.5.1 (due to CUDA related bugs in earlier versions)
Boost 1.5
CUDA 7.5 (8.0 recommended)

Optional

KenLM for n-gram language models (https://github.com/kpu/kenlm, current master)

Compilation

The project is a standard Cmake out-of-source build:

mkdir build
cd build
cmake ..
make -j

Or with KenLM support:

cmake .. -DKENLM=path/to/kenlm

On Ubuntu 16.04, you currently need g++4.9 to compile and cuda-7.5, this also requires a custom boost build compiled with g++4.9 instead of the standard g++5.3. The binaries are not compatible.

CUDA_BIN_PATH=/usr/local/cuda-7.5 BOOST_ROOT=/path/to/custom/boost cmake .. \
-DCMAKE_CXX_COMPILER=g++-4.9 -DCUDA_HOST_COMPILER=/usr/bin/g++-4.9

With cuda-8.0 (RC) it is possible to use g++5 (but not g++6) which makes most of the above tricks obsolete.

Vocabulary files

Vocabulary files (and all other config files) in AmuNMT are by default YAML files. AmuNMT also reads gzipped yml.gz files.

Vocabulary files from models trained with Nematus can be used directly as JSON is a proper subset of YAML.
Vocabularies for models trained with DL4MT (*.pkl extension) need to be converted to JSON/YAML with either of the two scripts below:

python scripts/pkl2json.py vocab.en.pkl > vocab.json
python scripts/pkl2yaml.py vocab.en.pkl > vocab.yml

Running AmuNMT

./bin/amun -c config.yml <<< "This is a test ."

Configuration files

An example configuration:

# Paths are relative to config file location
relative-paths: yes

# performance settings
beam-size: 12
devices: [0]
normalize: yes
threads-per-device: 1

# scorer configuration
scorers: 
  F0:
    path: model.en-de.npz 
    type: Nematus

# scorer weights
weights: 
  F0: 1.0

# vocabularies
source-vocab: vocab.en.yml.gz
target-vocab: vocab.de.yml.gz

Example usage

Data and systems for our winning system in the WMT 2016 Shared Task on Automatic Post-Editing