sentencepiece/README.md

294 lines
15 KiB
Markdown
Raw Normal View History

2017-03-07 13:43:50 +03:00
# SentencePiece
2018-06-16 17:23:25 +03:00
[![Build Status](https://travis-ci.org/google/sentencepiece.svg?branch=master)](https://travis-ci.org/google/sentencepiece)
[![Coverage Status](https://coveralls.io/repos/github/google/sentencepiece/badge.svg?branch=master)](https://coveralls.io/github/google/sentencepiece?branch=master)
2018-06-17 11:02:01 +03:00
[![GitHub Issues](https://img.shields.io/github/issues/google/sentencepiece.svg)](https://github.com/google/sentencepiece/issues)
2018-06-16 17:23:25 +03:00
[![PyPI version](https://badge.fury.io/py/sentencepiece.svg)](https://badge.fury.io/py/sentencepiece)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
2017-03-08 11:18:54 +03:00
2017-03-07 13:43:50 +03:00
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size
is predetermined prior to the neural model training. SentencePiece implements
2018-04-29 17:55:13 +03:00
**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and
2018-05-01 04:40:55 +03:00
**unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)])
2018-06-28 20:44:57 +03:00
with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
2017-03-07 13:43:50 +03:00
**This is not an official Google product.**
## Technical highlights
- **Purely data driven**: SentencePiece trains tokenization and detokenization
2018-07-12 03:38:11 +03:00
models from from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required.
2017-03-07 13:43:50 +03:00
- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
2018-06-28 20:47:11 +03:00
- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
2018-06-28 20:45:55 +03:00
- **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) which helps to improve the robustness and accuracy of NMT models.
- **Fast and lightweight**: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
2017-03-07 13:43:50 +03:00
- **Self-contained**: The same tokenization/detokenization is obtained as long as the same model file is used.
- **Direct vocabulary id generation**: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
- **NFKC-based normalization**: SentencePiece performs NFKC-based text normalization.
2018-04-30 03:52:19 +03:00
## Comparisons with other implementations
|Feature|SentencePiece|[subword-nmt](https://github.com/rsennrich/subword-nmt)|[WordPiece](https://arxiv.org/pdf/1609.08144.pdf)|
|:---|:---:|:---:|:---:|
|Supported algorithm|BPE, unigram, char, word|BPE|BPE*|
|OSS?|Yes|Yes|Google internal|
2018-05-01 13:23:03 +03:00
|Subword regularization|[Yes](#subword-regularization)|No|No|
|Python Library (pip)|[Yes](python/README.md)|No|N/A|
|C++ Library|[Yes](doc/api.md)|No|N/A|
|Pre-segmentation required?|[No](#whitespace-is-treated-as-a-basic-symbol)|Yes|Yes|
|Customizable normalization (e.g., NFKC)|[Yes](doc/normalization.md)|No|N/A|
|Direct id generation|[Yes](#end-to-end-example)|No|N/A|
2018-04-30 03:52:19 +03:00
Note that BPE algorithm used in WordPiece is slightly different from the original BPE.
2017-03-07 13:43:50 +03:00
## Overview
### What is SentencePiece?
2018-07-02 21:44:32 +03:00
SentencePiece is a re-implementation of **sub-word units**, an effective way to alleviate the open vocabulary
2018-05-02 03:42:13 +03:00
problems in neural machine translation. SentencePiece supports two segmentation algorithms, **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]. Here are the high level differences from other implementations.
2017-03-07 13:43:50 +03:00
#### The number of unique tokens is predetermined
Neural Machine Translation models typically operate with a fixed
vocabulary. Unlike most unsupervised word segmentation algorithms, which
assume an infinite vocabulary, SentencePiece trains the segmentation model such
that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.
2018-05-01 15:55:08 +03:00
Note that SentencePices specifies the final vocabulary size for training, which is different from
2018-05-01 15:11:54 +03:00
[subword-nmt](https://github.com/rsennrich/subword-nmt) that uses the number of merge operations.
The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.
2018-05-01 15:42:24 +03:00
#### Trains from raw sentences
Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese, Japanese and Korean where no explicit spaces exist between words.
2017-03-21 00:45:53 +03:00
#### Whitespace is treated as a basic symbol
2017-03-07 13:43:50 +03:00
The first step of Natural Language processing is text tokenization. For
example, a standard English tokenizer would segment the text "Hello world." into the
2017-03-07 13:43:50 +03:00
following three tokens.
> [Hello] [World] [.]
One observation is that the original input and tokenized sequence are **NOT
reversibly convertible**. For instance, the information that is no space between
“World” and “.” is dropped from the tokenized sequence, since e.g., `Tokenize(“World.”) == Tokenize(“World .”)`
2017-03-07 13:43:50 +03:00
SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.
> Hello▁World.
Then, this text is segmented into small pieces, for example:
2017-03-07 13:43:50 +03:00
> [Hello] [▁Wor] [ld] [.]
Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.
```
detokenized = ''.join(pieces).replace('_', ' ')
```
This feature makes it possible to perform detokenization without relying on language-specific resources.
Note that we cannot apply the same lossless conversions when splitting the
sentence with standard word segmenters, since they treat the whitespace as a
special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.
2017-03-07 13:43:50 +03:00
* (en) Hello world. → [Hello] [World] [.] \(A space between Hello and World\)
* (ja) こんにちは世界。 → [こんにちは] [世界] [。] \(No space between こんにちは and 世界\)
2018-05-01 13:04:19 +03:00
#### Subword regularization
2018-05-01 13:03:33 +03:00
Subword regularization [[Kudo.](https://arxiv.org/abs/1804.10959)] is a simple regularization method
2018-05-01 13:01:23 +03:00
that virtually augments training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.
2018-05-01 16:09:24 +03:00
To enable subword regularization, you would like to integrate SentencePiece library
([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode`` call. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).
2018-05-01 13:01:23 +03:00
```
>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor()
2018-05-01 13:16:01 +03:00
>>> s.Load('spm.model')
2018-05-01 13:01:23 +03:00
>>> for n in range(5):
... s.SampleEncode('New York', -1, 0.1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']
```
2018-04-29 18:19:20 +03:00
2018-04-30 03:52:19 +03:00
## Installation
2018-04-29 18:19:20 +03:00
2018-04-30 03:52:19 +03:00
### Python module
2018-02-28 14:56:07 +03:00
SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.
2018-05-01 13:01:23 +03:00
For Linux (x64/i686) environment, you can install Python binary package of SentencePiece with.
2018-02-28 14:56:07 +03:00
```
% pip install sentencepiece
```
2018-03-01 08:41:26 +03:00
For more detail, see [Python module](python/README.md)
2018-02-28 14:56:07 +03:00
2018-04-30 03:52:19 +03:00
### C++ (from source)
2017-03-07 13:43:50 +03:00
The following tools and libraries are required to build SentencePiece:
* GNU autotools (autoconf automake libtool)
* C++11 compiler
2017-03-09 09:27:19 +03:00
* [protobuf](https://github.com/google/protobuf) library
2017-03-07 13:43:50 +03:00
On Ubuntu, autotools can be installed with apt-get:
2017-03-07 13:43:50 +03:00
```
2017-11-07 10:55:49 +03:00
% sudo apt-get install autoconf automake libtool pkg-config libprotobuf9v5 protobuf-compiler libprotobuf-dev
2017-03-07 13:43:50 +03:00
```
The name of the protobuf library is different between ubuntu distros. Please enter appropriate command for your Ubuntu version.
On ubuntu 14.04 LTS (Trusty Tahr):
```
% sudo apt-get install libprotobuf8
```
On ubuntu 16.04 LTS (Xenial Xerus):
```
% sudo apt-get install libprotobuf9v5
```
On ubuntu 17.10 (Artful Aardvark) and Later:
```
% sudo apt-get install libprotobuf10
```
2017-03-07 13:43:50 +03:00
On OSX, you can use brew:
```
2017-05-15 12:13:45 +03:00
% brew install protobuf autoconf automake libtool
```
2017-05-16 05:50:39 +03:00
If want to use self-prepared protobuf library, setup below environment variables before build:
```
% export PROTOBUF=<path_to_protobuf>
% export PROTOC="$PROTOBUF/bin/protoc"
% export PROTOBUF_LIBS="-L$PROTOBUF/lib -lprotobuf -D_THREAD_SAFE"
% export PROTOBUF_CFLAGS="-I$PROTOBUF/include -D_THREAD_SAFE"
```
2018-04-30 03:52:19 +03:00
### Build and Install SentencePiece
2017-03-07 13:43:50 +03:00
```
% cd /path/to/sentencepiece
% ./autogen.sh
% ./configure
% make
% make check
% sudo make install
2017-03-21 07:14:47 +03:00
$ sudo ldconfig -v
2017-03-07 13:43:50 +03:00
```
2018-07-13 23:50:26 +03:00
On OSX/macOS, replace the last command with the following:
```$ sudo update_dyld_shared_cache```
2018-04-30 03:52:19 +03:00
## Usage instructions
### Train SentencePiece Model
2017-03-07 13:43:50 +03:00
```
% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --model_type=<type>
```
* `--input`: one-sentence-per-line **raw** corpus file. No need to run
tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes
the input with Unicode NFKC. You can pass a comma-separated list of files.
* `--model_prefix`: output model name prefix. `<model_name>.model` and `<model_name>.vocab` are generated.
* `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.
2017-03-07 13:43:50 +03:00
Note that `spm_train` loads only the first `--input_sentence_size` sentences (default value is 10M).
Use `--help` flag to display all parameters for training.
2018-04-30 03:52:19 +03:00
### Encode raw text into sentence pieces/ids
2017-03-07 13:43:50 +03:00
```
% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output
```
Use `--extra_options` flag to insert the BOS/EOS markers or reverse the input sequence.
```
% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)
```
2018-03-01 08:41:26 +03:00
SentencePiece supports nbest segmentation and segmentation sampling with `--output_format=(nbest|sample)_(piece|id)` flags.
2018-02-28 14:56:07 +03:00
```
% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output
```
2018-04-30 03:52:19 +03:00
### Decode sentence pieces/ids into raw text
2017-03-07 13:43:50 +03:00
```
% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output
```
Use `--extra_options` flag to decode the text in reverse order.
```
% spm_decode --extra_options=reverse < input > output
```
2018-04-30 03:52:19 +03:00
### End-to-End Example
2017-03-07 13:43:50 +03:00
```
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
2017-03-07 13:43:50 +03:00
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6
% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.
```
You can find that the original input sentence is restored from the vocabulary id sequence.
2018-04-30 03:52:19 +03:00
### Export vocabulary list
2017-03-07 13:43:50 +03:00
```
% spm_export_vocab --model=<model_file> --output=<output file>
```
```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.
2018-04-30 03:52:19 +03:00
### Redefine special meta tokens
2018-06-08 18:26:48 +03:00
By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.
2018-04-09 13:00:21 +03:00
```
2018-06-08 18:26:48 +03:00
% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=...
2018-04-09 13:00:21 +03:00
```
2018-06-08 18:26:48 +03:00
When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (&lt;pad&gt;) as ```--pad_id=3```.  
2018-04-09 13:00:21 +03:00
If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md).
2018-06-08 18:26:48 +03:00
### Vocabulary restriction
```spm_encode``` accepts a ```--vocabulary``` and a ```--vocabulary_threshold``` option so that ```spm_encode``` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is decribed in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).
The usage is basically the same as that of ```subword-nmt```. Assming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:
```
2018-06-08 18:34:58 +03:00
% cat {train_file}.L1 {train_file}.L2 | shuffle > train
2018-06-08 18:26:48 +03:00
% spm_train --input=train --model_prefix=spm --vocab_size=8000
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
```
2018-06-08 18:32:55 +03:00
```shuffle``` command is used just in case because ```spm_train``` loads the first 10M lines of corpus by default.
2018-06-08 18:26:48 +03:00
Then segment train/test corpus with ```--vocabulary``` option
```
% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2
```
2017-03-07 13:43:50 +03:00
## Advanced topics
2018-04-29 07:35:33 +03:00
* [SentencePiece Experiments](doc/experiments.md)
2017-03-07 13:43:50 +03:00
* [SentencePieceProcessor C++ API](doc/api.md)
* [Use custom text normalization rules](doc/normalization.md)
* [Use custom symbols](doc/special_symbols.md)
* [Segmentation and training algorithms in detail]