added initial support to compile on OSX

2024-08-16 06:10:47 +03:00 · 2017-05-15 11:07:42 +02:00 · 2017-05-15 11:07:42 +02:00 · f55d4a88db
commit f55d4a88db
parent 777e7133a1
3 changed files with 60 additions and 4 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,49 @@
+Makefile
+Makefile.in
+/ar-lib
+/mdate-sh
+/py-compile
+/test-driver
+/ylwrap
+
+/autom4te.cache
+/autoscan.log
+/autoscan-*.log
+/aclocal.m4
+/compile
+/config.guess
+/config.h.in
+/config.sub
+/configure
+/configure.scan
+/depcomp
+/install-sh
+/missing
+/stamp-h1
+/libtool
+/config.h
+/config.status
+/autogen.sh
+/ltmain.sh
+
+*.o
+*.lo
+*.a
+*.la
+
+.libs
+.deps
+
+*.m4
+*.log
+
+compile_charsmap
+
+spm_decode
+spm_encode
+spm_export_vocab
+spm_train
+spm_normalize
+
+*.pb.cc
+*.pb.h
--- a/README.md
+++ b/README.md
@ -80,6 +80,11 @@ On Ubuntu, autotools and protobuf library can be install with apt-get:
 ```
 (If `libprotobuf9v5` is not found, try `libprotobuf-c++` instead.)

+On OSX, you can use brew:
+```
+% brew install protobuf
+```
+
 ## Build and Install SentencePiece
 ```
 % cd /path/to/sentencepiece
@ -131,7 +136,7 @@ Use `--extra_options` flag to decode the text in reverse order.
 ## End-to-End Example
 ```
 % spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
-unigram_model_trainer.cc(494) LOG(INFO) Starts training with : 
+unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
 input: "../data/botchan.txt"
 ... <snip>
 unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
@ -167,7 +172,7 @@ You can find that the original input sentence is restored from the vocabulary id
    *   **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
    *   **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
    *   *char**: Segments sentence by characters.
-    
+
 *   Data sets:
    *   [KFTT](http://www.phontron.com/kftt/index.html)

@ -180,7 +185,7 @@ You can find that the original input sentence is restored from the vocabulary id
 *   Evaluation metrics:
    *   Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.

-    
+
 ### Results (BLEU scores)
 #### English to Japanese
 |Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
--- a/autogen.sh
+++ b/autogen.sh
@ -18,7 +18,9 @@ aclocal -I .
 echo "Running autoheader..."
 autoheader
 echo "Running libtoolize .."
-libtoolize
+case `uname` in Darwin*) glibtoolize ;;
+  *) libtoolize ;;
+esac
 echo "Running automake ..."
 automake --add-missing --copy
 echo "Running autoconf ..."