added initial support to compile on OSX

2024-10-26 11:38:45 +03:00 · 2017-05-15 11:07:42 +02:00 · 2017-05-15 11:07:42 +02:00 · f55d4a88db
commit f55d4a88db
parent 777e7133a1
3 changed files with 60 additions and 4 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,49 @@
 Makefile
 Makefile.in
 /ar-lib
 /mdate-sh
 /py-compile
 /test-driver
 /ylwrap
 /autom4te.cache
 /autoscan.log
 /autoscan-*.log
 /aclocal.m4
 /compile
 /config.guess
 /config.h.in
 /config.sub
 /configure
 /configure.scan
 /depcomp
 /install-sh
 /missing
 /stamp-h1
 /libtool
 /config.h
 /config.status
 /autogen.sh
 /ltmain.sh
 *.o
 *.lo
 *.a
 *.la
 .libs
 .deps
 *.m4
 *.log
 compile_charsmap
 spm_decode
 spm_encode
 spm_export_vocab
 spm_train
 spm_normalize
 *.pb.cc
 *.pb.h
--- a/README.md
+++ b/README.md
@ -80,6 +80,11 @@ On Ubuntu, autotools and protobuf library can be install with apt-get:
 ```
 (If `libprotobuf9v5` is not found, try `libprotobuf-c++` instead.)
 On OSX, you can use brew:
 ```
 % brew install protobuf
 ```
 ## Build and Install SentencePiece
 ```
 % cd /path/to/sentencepiece
@ -131,7 +136,7 @@ Use `--extra_options` flag to decode the text in reverse order.
 ## End-to-End Example
 ```
 % spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
-unigram_model_trainer.cc(494) LOG(INFO) Starts training with : 
+unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
 input: "../data/botchan.txt"
 ... <snip>
 unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
@ -167,7 +172,7 @@ You can find that the original input sentence is restored from the vocabulary id
    *   **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
    *   **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
    *   *char**: Segments sentence by characters.
-    
+
 *   Data sets:
    *   [KFTT](http://www.phontron.com/kftt/index.html)
@ -180,7 +185,7 @@ You can find that the original input sentence is restored from the vocabulary id
 *   Evaluation metrics:
    *   Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
-    
+
 ### Results (BLEU scores)
 #### English to Japanese
 |Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
--- a/autogen.sh
+++ b/autogen.sh
@ -18,7 +18,9 @@ aclocal -I .
 echo "Running autoheader..."
 autoheader
 echo "Running libtoolize .."
-libtoolize
+case `uname` in Darwin*) glibtoolize ;;
  *) libtoolize ;;
 esac
 echo "Running automake ..."
 automake --add-missing --copy
 echo "Running autoconf ..."