mirror of
https://github.com/google/sentencepiece.git
synced 2024-10-26 11:38:45 +03:00
added initial support to compile on OSX
This commit is contained in:
parent
777e7133a1
commit
f55d4a88db
49
.gitignore
vendored
Normal file
49
.gitignore
vendored
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
Makefile
|
||||||
|
Makefile.in
|
||||||
|
/ar-lib
|
||||||
|
/mdate-sh
|
||||||
|
/py-compile
|
||||||
|
/test-driver
|
||||||
|
/ylwrap
|
||||||
|
|
||||||
|
/autom4te.cache
|
||||||
|
/autoscan.log
|
||||||
|
/autoscan-*.log
|
||||||
|
/aclocal.m4
|
||||||
|
/compile
|
||||||
|
/config.guess
|
||||||
|
/config.h.in
|
||||||
|
/config.sub
|
||||||
|
/configure
|
||||||
|
/configure.scan
|
||||||
|
/depcomp
|
||||||
|
/install-sh
|
||||||
|
/missing
|
||||||
|
/stamp-h1
|
||||||
|
/libtool
|
||||||
|
/config.h
|
||||||
|
/config.status
|
||||||
|
/autogen.sh
|
||||||
|
/ltmain.sh
|
||||||
|
|
||||||
|
*.o
|
||||||
|
*.lo
|
||||||
|
*.a
|
||||||
|
*.la
|
||||||
|
|
||||||
|
.libs
|
||||||
|
.deps
|
||||||
|
|
||||||
|
*.m4
|
||||||
|
*.log
|
||||||
|
|
||||||
|
compile_charsmap
|
||||||
|
|
||||||
|
spm_decode
|
||||||
|
spm_encode
|
||||||
|
spm_export_vocab
|
||||||
|
spm_train
|
||||||
|
spm_normalize
|
||||||
|
|
||||||
|
*.pb.cc
|
||||||
|
*.pb.h
|
11
README.md
11
README.md
@ -80,6 +80,11 @@ On Ubuntu, autotools and protobuf library can be install with apt-get:
|
|||||||
```
|
```
|
||||||
(If `libprotobuf9v5` is not found, try `libprotobuf-c++` instead.)
|
(If `libprotobuf9v5` is not found, try `libprotobuf-c++` instead.)
|
||||||
|
|
||||||
|
On OSX, you can use brew:
|
||||||
|
```
|
||||||
|
% brew install protobuf
|
||||||
|
```
|
||||||
|
|
||||||
## Build and Install SentencePiece
|
## Build and Install SentencePiece
|
||||||
```
|
```
|
||||||
% cd /path/to/sentencepiece
|
% cd /path/to/sentencepiece
|
||||||
@ -131,7 +136,7 @@ Use `--extra_options` flag to decode the text in reverse order.
|
|||||||
## End-to-End Example
|
## End-to-End Example
|
||||||
```
|
```
|
||||||
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
|
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
|
||||||
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
|
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
|
||||||
input: "../data/botchan.txt"
|
input: "../data/botchan.txt"
|
||||||
... <snip>
|
... <snip>
|
||||||
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
|
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
|
||||||
@ -167,7 +172,7 @@ You can find that the original input sentence is restored from the vocabulary id
|
|||||||
* **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
|
* **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
|
||||||
* **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
|
* **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
|
||||||
* *char**: Segments sentence by characters.
|
* *char**: Segments sentence by characters.
|
||||||
|
|
||||||
* Data sets:
|
* Data sets:
|
||||||
* [KFTT](http://www.phontron.com/kftt/index.html)
|
* [KFTT](http://www.phontron.com/kftt/index.html)
|
||||||
|
|
||||||
@ -180,7 +185,7 @@ You can find that the original input sentence is restored from the vocabulary id
|
|||||||
* Evaluation metrics:
|
* Evaluation metrics:
|
||||||
* Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
|
* Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
|
||||||
|
|
||||||
|
|
||||||
### Results (BLEU scores)
|
### Results (BLEU scores)
|
||||||
#### English to Japanese
|
#### English to Japanese
|
||||||
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|
||||||
|
@ -18,7 +18,9 @@ aclocal -I .
|
|||||||
echo "Running autoheader..."
|
echo "Running autoheader..."
|
||||||
autoheader
|
autoheader
|
||||||
echo "Running libtoolize .."
|
echo "Running libtoolize .."
|
||||||
libtoolize
|
case `uname` in Darwin*) glibtoolize ;;
|
||||||
|
*) libtoolize ;;
|
||||||
|
esac
|
||||||
echo "Running automake ..."
|
echo "Running automake ..."
|
||||||
automake --add-missing --copy
|
automake --add-missing --copy
|
||||||
echo "Running autoconf ..."
|
echo "Running autoconf ..."
|
||||||
|
Loading…
Reference in New Issue
Block a user