mirror of
https://github.com/google/sentencepiece.git
synced 2024-08-16 06:10:47 +03:00
added initial support to compile on OSX
This commit is contained in:
parent
777e7133a1
commit
f55d4a88db
49
.gitignore
vendored
Normal file
49
.gitignore
vendored
Normal file
@ -0,0 +1,49 @@
|
||||
Makefile
|
||||
Makefile.in
|
||||
/ar-lib
|
||||
/mdate-sh
|
||||
/py-compile
|
||||
/test-driver
|
||||
/ylwrap
|
||||
|
||||
/autom4te.cache
|
||||
/autoscan.log
|
||||
/autoscan-*.log
|
||||
/aclocal.m4
|
||||
/compile
|
||||
/config.guess
|
||||
/config.h.in
|
||||
/config.sub
|
||||
/configure
|
||||
/configure.scan
|
||||
/depcomp
|
||||
/install-sh
|
||||
/missing
|
||||
/stamp-h1
|
||||
/libtool
|
||||
/config.h
|
||||
/config.status
|
||||
/autogen.sh
|
||||
/ltmain.sh
|
||||
|
||||
*.o
|
||||
*.lo
|
||||
*.a
|
||||
*.la
|
||||
|
||||
.libs
|
||||
.deps
|
||||
|
||||
*.m4
|
||||
*.log
|
||||
|
||||
compile_charsmap
|
||||
|
||||
spm_decode
|
||||
spm_encode
|
||||
spm_export_vocab
|
||||
spm_train
|
||||
spm_normalize
|
||||
|
||||
*.pb.cc
|
||||
*.pb.h
|
11
README.md
11
README.md
@ -80,6 +80,11 @@ On Ubuntu, autotools and protobuf library can be install with apt-get:
|
||||
```
|
||||
(If `libprotobuf9v5` is not found, try `libprotobuf-c++` instead.)
|
||||
|
||||
On OSX, you can use brew:
|
||||
```
|
||||
% brew install protobuf
|
||||
```
|
||||
|
||||
## Build and Install SentencePiece
|
||||
```
|
||||
% cd /path/to/sentencepiece
|
||||
@ -131,7 +136,7 @@ Use `--extra_options` flag to decode the text in reverse order.
|
||||
## End-to-End Example
|
||||
```
|
||||
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
|
||||
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
|
||||
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
|
||||
input: "../data/botchan.txt"
|
||||
... <snip>
|
||||
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
|
||||
@ -167,7 +172,7 @@ You can find that the original input sentence is restored from the vocabulary id
|
||||
* **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
|
||||
* **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
|
||||
* *char**: Segments sentence by characters.
|
||||
|
||||
|
||||
* Data sets:
|
||||
* [KFTT](http://www.phontron.com/kftt/index.html)
|
||||
|
||||
@ -180,7 +185,7 @@ You can find that the original input sentence is restored from the vocabulary id
|
||||
* Evaluation metrics:
|
||||
* Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
|
||||
|
||||
|
||||
|
||||
### Results (BLEU scores)
|
||||
#### English to Japanese
|
||||
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|
||||
|
@ -18,7 +18,9 @@ aclocal -I .
|
||||
echo "Running autoheader..."
|
||||
autoheader
|
||||
echo "Running libtoolize .."
|
||||
libtoolize
|
||||
case `uname` in Darwin*) glibtoolize ;;
|
||||
*) libtoolize ;;
|
||||
esac
|
||||
echo "Running automake ..."
|
||||
automake --add-missing --copy
|
||||
echo "Running autoconf ..."
|
||||
|
Loading…
Reference in New Issue
Block a user