sentencepiece/README.md

# SentencePiece

[![Build C++](https://github.com/google/sentencepiece/actions/workflows/cmake.yml/badge.svg)](https://github.com/google/sentencepiece/actions/workflows/cmake.yml)
[![Build Wheels](https://github.com/google/sentencepiece/actions/workflows/wheel.yml/badge.svg)](https://github.com/google/sentencepiece/actions/workflows/wheel.yml)
[![GitHub Issues](https://img.shields.io/github/issues/google/sentencepiece.svg)](https://github.com/google/sentencepiece/issues)
[![PyPI version](https://badge.fury.io/py/sentencepiece.svg)](https://badge.fury.io/py/sentencepiece)
[![PyPi downloads](https://img.shields.io/pypi/dm/sentencepiece?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/sentencepiece/)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
[![SLSA 3](https://slsa.dev/images/gh-badge-level3.svg)](https://slsa.dev)

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size
is predetermined prior to the neural model training. SentencePiece implements
**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]) and
**unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)])
with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

**This is not an official Google product.**

## Technical highlights
- **Purely data driven**: SentencePiece trains tokenization and detokenization
  models from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required.
- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
- **Multiple subword algorithms**: **BPE**  [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
- **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) and [BPE-dropout](https://arxiv.org/abs/1910.13267) which help to improve the robustness and accuracy of NMT models.
- **Fast and lightweight**: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
- **Self-contained**: The same tokenization/detokenization is obtained as long as the same model file is used.
- **Direct vocabulary id generation**: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
- **NFKC-based normalization**: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read [a gentle introduction here](https://medium.com/@jacky2wong/understanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08).


## Comparisons with other implementations
|Feature|SentencePiece|[subword-nmt](https://github.com/rsennrich/subword-nmt)|[WordPiece](https://arxiv.org/pdf/1609.08144.pdf)|
|:---|:---:|:---:|:---:|
|Supported algorithm|BPE, unigram, char, word|BPE|BPE*|
|OSS?|Yes|Yes|Google internal|
|Subword regularization|[Yes](#subword-regularization-and-bpe-dropout)|No|No|
|Python Library (pip)|[Yes](python/README.md)|No|N/A|
|C++ Library|[Yes](doc/api.md)|No|N/A|
|Pre-segmentation required?|[No](#whitespace-is-treated-as-a-basic-symbol)|Yes|Yes|
|Customizable normalization (e.g., NFKC)|[Yes](doc/normalization.md)|No|N/A|
|Direct id generation|[Yes](#end-to-end-example)|No|N/A|

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

## Overview
### What is SentencePiece?
SentencePiece is a re-implementation of **sub-word units**, an effective way to alleviate the open vocabulary
  problems in neural machine translation. SentencePiece supports two segmentation algorithms, **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]. Here are the high level differences from other implementations.

#### The number of unique tokens is predetermined
Neural Machine Translation models typically operate with a fixed
vocabulary. Unlike most unsupervised word segmentation algorithms, which
assume an infinite vocabulary, SentencePiece trains the segmentation model such
that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from
[subword-nmt](https://github.com/rsennrich/subword-nmt) that uses the number of merge operations.
The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

#### Trains from raw sentences
Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

#### Whitespace is treated as a basic symbol
The first step of Natural Language processing is text tokenization. For
example, a standard English tokenizer would segment the text "Hello world." into the
following three tokens.

> [Hello] [World] [.]

One observation is that the original input and tokenized sequence are **NOT
reversibly convertible**. For instance, the information that is no space between
“World” and “.” is dropped from the tokenized sequence, since e.g., `Tokenize(“World.”) == Tokenize(“World .”)`

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

> Hello▁World.

Then, this text is segmented into small pieces, for example:

> [Hello] [▁Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

```
  detokenized = ''.join(pieces).replace('▁', ' ')
```

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the
sentence with standard word segmenters, since they treat the whitespace as a
special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

* (en) Hello world.   → [Hello] [World] [.]   \(A space between Hello and World\)
* (ja) こんにちは世界。  → [こんにちは] [世界] [。] \(No space between こんにちは and 世界\)

#### Subword regularization and BPE-dropout
Subword regularization [[Kudo.](https://arxiv.org/abs/1804.10959)] and BPE-dropout [Provilkov et al](https://arxiv.org/abs/1910.13267) are simple regularization methods
that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library
([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode (C++)`` or ``encode with enable_sampling=True (Python)`` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).

```
>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']
```

## Installation

### Python module
SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.
You can install Python binary package of SentencePiece with.

```
pip install sentencepiece
```

For more detail, see [Python module](python/README.md)

### Build and install SentencePiece command line tools from C++ source
The following tools and libraries are required to build SentencePiece:

* [cmake](https://cmake.org/)
* C++11 compiler
* [gperftools](https://github.com/gperftools/gperftools) library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:
```
% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
```

Then, you can build and install command line tools as follows.
```
% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v
```
On OSX/macOS, replace the last command with `sudo update_dyld_shared_cache`

### Build and install using vcpkg

You can download and install sentencepiece using the [vcpkg](https://github.com/Microsoft/vcpkg) dependency manager:

    git clone https://github.com/Microsoft/vcpkg.git
    cd vcpkg
    ./bootstrap-vcpkg.sh
    ./vcpkg integrate install
    ./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.

### Download and install SentencePiece from signed released wheels

You can download the wheel from the [GitHub releases page](https://github.com/google/sentencepiece/releases/latest).
We generate [SLSA3 signatures](slsa.dev) using the OpenSSF's [slsa-framework/slsa-github-generator](https://github.com/slsa-framework/slsa-github-generator) during the release process. To verify a release binary:
1. Install the verification tool from [slsa-framework/slsa-verifier#installation](https://github.com/slsa-framework/slsa-verifier#installation).
2. Download the provenance file `attestation.intoto.jsonl` from the [GitHub releases page](https://github.com/google/sentencepiece/releases/latest).
3. Run the verifier:
```shell
slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>
```

pip install wheel_file.whl

## Usage instructions
### Train SentencePiece Model
```
% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
```
* `--input`: one-sentence-per-line **raw** corpus file. No need to run
  tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes
  the input with Unicode NFKC. You can pass a comma-separated list of files.
* `--model_prefix`: output model name prefix. `<model_name>.model` and `<model_name>.vocab` are generated.
* `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
* `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.
* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.

Use `--help` flag to display all parameters for training, or see [here](doc/options.md) for an overview.

### Encode raw text into sentence pieces/ids
```
% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output
```

Use `--extra_options` flag to insert the BOS/EOS markers or reverse the input sequence.
```
% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)
```

SentencePiece supports nbest segmentation and segmentation sampling with `--output_format=(nbest|sample)_(piece|id)` flags.
```
% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output
```

### Decode sentence pieces/ids into raw text
```
% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output
```
Use `--extra_options` flag to decode the text in reverse order.
```
% spm_decode --extra_options=reverse < input > output
```

### End-to-End Example
```
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.
```
You can find that the original input sentence is restored from the vocabulary id sequence.

### Export vocabulary list
```
% spm_export_vocab --model=<model_file> --output=<output file>
```
```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

### Redefine special meta tokens
  By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

```
% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...
```
When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknown id cannot be disabled.  We can define an id for padding (&lt;pad&gt;) as ```--pad_id=3```.  

If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md).

### Vocabulary restriction
```spm_encode``` accepts a ```--vocabulary``` and a ```--vocabulary_threshold``` option so that ```spm_encode``` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).

The usage is basically the same as that of ```subword-nmt```. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

```
% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
```

```shuffle``` command is used just in case because ```spm_train``` loads the first 10M lines of corpus by default.


Then segment train/test corpus with ```--vocabulary``` option
```
% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2
```

## Advanced topics

* [SentencePiece Experiments](doc/experiments.md)
* [SentencePieceProcessor C++ API](doc/api.md)
* [Use custom text normalization rules](doc/normalization.md)
* [Use custom symbols](doc/special_symbols.md)
* [Python Module](python/README.md)
* [Segmentation and training algorithms in detail]
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								# SentencePiece
-												Update README.md
											
										
										
											2022-06-05 16:37:50 +03:00
+								[![Build C++](https://github.com/google/sentencepiece/actions/workflows/cmake.yml/badge.svg)](https://github.com/google/sentencepiece/actions/workflows/cmake.yml)
 								[![Build Wheels](https://github.com/google/sentencepiece/actions/workflows/wheel.yml/badge.svg)](https://github.com/google/sentencepiece/actions/workflows/wheel.yml)
-												Update README.md
											
										
										
											2018-06-17 11:02:01 +03:00
+								[![GitHub Issues](https://img.shields.io/github/issues/google/sentencepiece.svg)](https://github.com/google/sentencepiece/issues)
-												Update README.md
											
										
										
											2018-06-16 17:23:25 +03:00
+								[![PyPI version](https://badge.fury.io/py/sentencepiece.svg)](https://badge.fury.io/py/sentencepiece)
-												Update README.md
											
										
										
											2020-10-23 19:08:01 +03:00
+								[![PyPi downloads](https://img.shields.io/pypi/dm/sentencepiece?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/sentencepiece/)
-												Update README.md
											
										
										
											2018-06-16 17:23:25 +03:00
+								[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
 								[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
-												update

											
										
										
											2022-08-15 18:41:14 +03:00
+								[![SLSA 3](https://slsa.dev/images/gh-badge-level3.svg)](https://slsa.dev)
-												Added travis status

											
										
										
											2017-03-08 11:18:54 +03:00
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
 								Neural Network-based text generation systems where the vocabulary size
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								is predetermined prior to the neural model training. SentencePiece implements
-												Update README.md

Use https
											
										
										
											2022-03-31 17:59:18 +03:00
+								**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]) and
-												Update README.md
											
										
										
											2018-05-01 04:40:55 +03:00
+								**unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)])
-												Update README.md
											
										
										
											2018-06-28 20:44:57 +03:00
+								with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
 								**This is not an official Google product.**
 								## Technical highlights
 								- **Purely data driven**: SentencePiece trains tokenization and detokenization
-												Typo fix in README
											
										
										
											2018-07-16 20:28:42 +03:00
+								  models from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
-												Update README.md

Use https
											
										
										
											2022-03-31 17:59:18 +03:00
+								- **Multiple subword algorithms**: **BPE**  [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
-												Update README.md
											
										
										
											2020-05-21 05:37:42 +03:00
+								- **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) and [BPE-dropout](https://arxiv.org/abs/1910.13267) which help to improve the robustness and accuracy of NMT models.
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								- **Fast and lightweight**: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								- **Self-contained**: The same tokenization/detokenization is obtained as long as the same model file is used.
 								- **Direct vocabulary id generation**: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
 								- **NFKC-based normalization**: SentencePiece performs NFKC-based text normalization.
-												Update README.md

Resolves #603
											
										
										
											2021-01-04 19:47:51 +03:00
+								For those unfamiliar with SentencePiece as a software/algorithm, one can read [a gentle introduction here](https://medium.com/@jacky2wong/understanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08).
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								## Comparisons with other implementations
 								|Feature|SentencePiece|[subword-nmt](https://github.com/rsennrich/subword-nmt)|[WordPiece](https://arxiv.org/pdf/1609.08144.pdf)|
 								|:---|:---:|:---:|:---:|
 								|Supported algorithm|BPE, unigram, char, word|BPE|BPE*|
 								|OSS?|Yes|Yes|Google internal|
-												Fix dead links

											
										
										
											2022-08-09 15:15:51 +03:00
+								|Subword regularization|[Yes](#subword-regularization-and-bpe-dropout)|No|No|
-												Update README.md
											
										
										
											2018-05-01 13:23:03 +03:00
+								|Python Library (pip)|[Yes](python/README.md)|No|N/A|
 								|C++ Library|[Yes](doc/api.md)|No|N/A|
 								|Pre-segmentation required?|[No](#whitespace-is-treated-as-a-basic-symbol)|Yes|Yes|
 								|Customizable normalization (e.g., NFKC)|[Yes](doc/normalization.md)|No|N/A|
 								|Direct id generation|[Yes](#end-to-end-example)|No|N/A|
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
 								Note that BPE algorithm used in WordPiece is slightly different from the original BPE.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								## Overview
 								### What is SentencePiece?
-												typo
											
										
										
											2018-07-02 21:44:32 +03:00
+								SentencePiece is a re-implementation of **sub-word units**, an effective way to alleviate the open vocabulary
-												Update README.md
											
										
										
											2018-05-02 03:42:13 +03:00
+								  problems in neural machine translation. SentencePiece supports two segmentation algorithms, **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]. Here are the high level differences from other implementations.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
 								#### The number of unique tokens is predetermined
 								Neural Machine Translation models typically operate with a fixed
 								vocabulary. Unlike most unsupervised word segmentation algorithms, which
 								assume an infinite vocabulary, SentencePiece trains the segmentation model such
 								that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.
-												minor spelling tweaks

											
										
										
											2018-12-08 18:14:20 +03:00
+								Note that SentencePiece specifies the final vocabulary size for training, which is different from
-												Update README.md
											
										
										
											2018-05-01 15:11:54 +03:00
+								[subword-nmt](https://github.com/rsennrich/subword-nmt) that uses the number of merge operations.
 								The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.
-												Update README.md
											
										
										
											2018-05-01 15:42:24 +03:00
+								#### Trains from raw sentences
 								Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
-												Removes reference to that Korean has no spaces

Removes reference to that Korean has no spaces, due to that it has. [Reference](http://www.koreanwikiproject.com/wiki/Word_spacing)
											
										
										
											2019-09-11 17:21:55 +03:00
+								The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.
-												Update README.md
											
										
										
											2018-05-01 15:42:24 +03:00
-												A minor README tweak.
											
										
										
											2017-03-21 00:45:53 +03:00
+								#### Whitespace is treated as a basic symbol
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								The first step of Natural Language processing is text tokenization. For
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								example, a standard English tokenizer would segment the text "Hello world." into the
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								following three tokens.
 								> [Hello] [World] [.]
 								One observation is that the original input and tokenized sequence are **NOT
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								reversibly convertible**. For instance, the information that is no space between
 								“World” and “.” is dropped from the tokenized sequence, since e.g., `Tokenize(“World.”) == Tokenize(“World .”)`
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
 								SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.
 								> Hello▁World.
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								Then, this text is segmented into small pieces, for example:
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
 								> [Hello] [▁Wor] [ld] [.]
 								Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.
 								```
-												Fix SIL symbol in code snippet: _ -> _

											
										
										
											2020-10-15 14:51:35 +03:00
+								  detokenized = ''.join(pieces).replace('▁', ' ')
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
 								This feature makes it possible to perform detokenization without relying on language-specific resources.
 								Note that we cannot apply the same lossless conversions when splitting the
 								sentence with standard word segmenters, since they treat the whitespace as a
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
 								* (en) Hello world.   → [Hello] [World] [.]   \(A space between Hello and World\)
 								* (ja) こんにちは世界。  → [こんにちは] [世界] [。] \(No space between こんにちは and 世界\)
-												Update README.md
											
										
										
											2020-05-21 05:36:33 +03:00
+								#### Subword regularization and BPE-dropout
-												fix typo
											
										
										
											2021-04-23 02:46:42 +03:00
+								Subword regularization [[Kudo.](https://arxiv.org/abs/1804.10959)] and BPE-dropout [Provilkov et al](https://arxiv.org/abs/1910.13267) are simple regularization methods
-												Update README.md
											
										
										
											2020-05-21 05:36:33 +03:00
+								that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.
-												Update README.md
											
										
										
											2018-05-01 13:01:23 +03:00
-												Update README.md

											
										
										
											2020-10-01 13:50:56 +03:00
+								To enable subword regularization, you would like to integrate SentencePiece library
-												Update README.md
											
										
										
											2020-05-21 05:36:33 +03:00
+								([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode (C++)`` or ``encode with enable_sampling=True (Python)`` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).
-												Update README.md
											
										
										
											2018-05-01 13:01:23 +03:00
 								```
 								>>> import sentencepiece as spm
-												Update README.md
											
										
										
											2020-05-21 05:36:33 +03:00
+								>>> s = spm.SentencePieceProcessor(model_file='spm.model')
-												Update README.md
											
										
										
											2018-05-01 13:01:23 +03:00
+								>>> for n in range(5):
-												Update README.md
											
										
										
											2021-04-20 05:48:37 +03:00
+								...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
-												Update README.md

											
										
										
											2020-10-01 13:50:56 +03:00
+								...
-												Update README.md
											
										
										
											2018-05-01 13:01:23 +03:00
+								['▁', 'N', 'e', 'w', '▁York']
 								['▁', 'New', '▁York']
 								['▁', 'New', '▁Y', 'o', 'r', 'k']
 								['▁', 'New', '▁York']
 								['▁', 'New', '▁York']
 								```
-												Update README.md
											
										
										
											2018-04-29 18:19:20 +03:00
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								## Installation
-												Update README.md
											
										
										
											2018-04-29 18:19:20 +03:00
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								### Python module
-												Updated document

											
										
										
											2018-02-28 14:56:07 +03:00
+								SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.
-												Update README.md
											
										
										
											2018-08-26 19:41:20 +03:00
+								You can install Python binary package of SentencePiece with.
-												Updated document

											
										
										
											2018-02-28 14:56:07 +03:00
 								```
-												Update README.md

Two unnecessary characters
											
										
										
											2023-01-17 23:31:20 +03:00
+								pip install sentencepiece
-												Updated document

											
										
										
											2018-02-28 14:56:07 +03:00
+								```
-												Update README.md
											
										
										
											2018-03-01 08:41:26 +03:00
+								For more detail, see [Python module](python/README.md)
-												Updated document

											
										
										
											2018-02-28 14:56:07 +03:00
-												Update README.md
											
										
										
											2020-11-13 19:04:54 +03:00
+								### Build and install SentencePiece command line tools from C++ source
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								The following tools and libraries are required to build SentencePiece:
-												Update README.md
											
										
										
											2018-07-24 09:42:49 +03:00
+								* [cmake](https://cmake.org/)
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								* C++11 compiler
-												Update README.md
											
										
										
											2019-01-09 03:31:31 +03:00
+								* [gperftools](https://github.com/gperftools/gperftools) library (optional, 10-40% performance improvement can be obtained.)
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
-												Update README.md
											
										
										
											2019-01-08 16:11:48 +03:00
+								On Ubuntu, the build tools can be installed with apt-get:
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
-												updated the document

											
										
										
											2019-01-10 09:20:05 +03:00
+								% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
-												Update README.md
											
										
										
											2020-11-13 19:04:54 +03:00
 								Then, you can build and install command line tools as follows.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
-												Update README.md
											
										
										
											2020-11-13 19:04:54 +03:00
+								% git clone https://github.com/google/sentencepiece.git
 								% cd sentencepiece
-												Update README.md
											
										
										
											2018-07-24 09:42:49 +03:00
+								% mkdir build
 								% cd build
 								% cmake ..
-												using -j $(nproc) when making
											
										
										
											2018-08-21 09:12:34 +03:00
+								% make -j $(nproc)
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								% sudo make install
-												Update README.md
											
										
										
											2018-08-08 07:50:24 +03:00
+								% sudo ldconfig -v
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
-												Update README.md
											
										
										
											2019-01-14 10:50:35 +03:00
+								On OSX/macOS, replace the last command with `sudo update_dyld_shared_cache`
-												Update README.md
											
										
										
											2020-11-13 19:04:54 +03:00
+								### Build and install using vcpkg
 								You can download and install sentencepiece using the [vcpkg](https://github.com/Microsoft/vcpkg) dependency manager:
 								    git clone https://github.com/Microsoft/vcpkg.git
 								    cd vcpkg
 								    ./bootstrap-vcpkg.sh
 								    ./vcpkg integrate install
 								    ./vcpkg install sentencepiece
 								The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.
-												Updating install instructions for OSX/macOS
											
										
										
											2018-07-13 23:46:04 +03:00
-												update

											
										
										
											2022-08-12 18:42:03 +03:00
+								### Download and install SentencePiece from signed released wheels
 								You can download the wheel from the [GitHub releases page](https://github.com/google/sentencepiece/releases/latest).
 								We generate [SLSA3 signatures](slsa.dev) using the OpenSSF's [slsa-framework/slsa-github-generator](https://github.com/slsa-framework/slsa-github-generator) during the release process. To verify a release binary:
 . Install the verification tool from [slsa-framework/slsa-verifier#installation](https://github.com/slsa-framework/slsa-verifier#installation).
 . Download the provenance file `attestation.intoto.jsonl` from the [GitHub releases page](https://github.com/google/sentencepiece/releases/latest).
 . Run the verifier:
 								```shell
 								slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>
 								```
 								pip install wheel_file.whl
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								## Usage instructions
 								### Train SentencePiece Model
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
-												Add info about `--character_coverage` to README.md
											
										
										
											2018-08-17 00:19:48 +03:00
+								% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
 								* `--input`: one-sentence-per-line **raw** corpus file. No need to run
 								  tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes
 								  the input with Unicode NFKC. You can pass a comma-separated list of files.
 								* `--model_prefix`: output model name prefix. `<model_name>.model` and `<model_name>.vocab` are generated.
 								* `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
-												Fix typo in readme
											
										
										
											2021-02-26 06:20:25 +03:00
+								* `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.
-												Fixed typo in README.md. Fixed the description for protobuf. Fixed the bug in bpe_train

											
										
										
											2017-03-08 08:58:16 +03:00
+								* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
-												Update README.md

											
										
										
											2020-10-01 13:50:56 +03:00
+								Use `--help` flag to display all parameters for training, or see [here](doc/options.md) for an overview.
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								### Encode raw text into sentence pieces/ids
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
 								% spm_encode --model=<model_file> --output_format=piece < input > output
 								% spm_encode --model=<model_file> --output_format=id < input > output
 								```
 								Use `--extra_options` flag to insert the BOS/EOS markers or reverse the input sequence.
 								```
 								% spm_encode --extra_options=eos (add </s> only)
 								% spm_encode --extra_options=bos:eos (add <s> and </s>)
 								% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)
 								```
-												Update README.md
											
										
										
											2018-03-01 08:41:26 +03:00
+								SentencePiece supports nbest segmentation and segmentation sampling with `--output_format=(nbest|sample)_(piece|id)` flags.
-												Updated document

											
										
										
											2018-02-28 14:56:07 +03:00
+								```
 								% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
 								% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output
 								```
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								### Decode sentence pieces/ids into raw text
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
 								% spm_decode --model=<model_file> --input_format=piece < input > output
 								% spm_decode --model=<model_file> --input_format=id < input > output
 								```
 								Use `--extra_options` flag to decode the text in reverse order.
 								```
 								% spm_decode --extra_options=reverse < input > output
 								```
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								### End-to-End Example
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
 								% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
-												added initial support to compile on OSX

											
										
										
											2017-05-15 12:07:42 +03:00
+								unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								input: "../data/botchan.txt"
 								... <snip>
 								unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
 								trainer_interface.cc(272) LOG(INFO) Saving model: m.model
 								trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
 								% echo "I saw a girl with a telescope." | spm_encode --model=m.model
 								▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
 								% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
 459 11 939 44 11 4 142 82 8 28 21 132 6
 								% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
 								I saw a girl with a telescope.
 								```
 								You can find that the original input sentence is restored from the vocabulary id sequence.
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								### Export vocabulary list
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								```
 								% spm_export_vocab --model=<model_file> --output=<output file>
 								```
 								```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.
-												Update README.md
											
										
										
											2018-04-30 03:52:19 +03:00
+								### Redefine special meta tokens
-												Update README.md
											
										
										
											2018-06-08 18:26:48 +03:00
+								  By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.
-												Update README.md

											
										
										
											2020-10-01 13:50:56 +03:00
-												Update README.md
											
										
										
											2018-04-09 13:00:21 +03:00
+								```
-												Update README.md

											
										
										
											2020-10-01 13:50:56 +03:00
+								% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...
-												Update README.md
											
										
										
											2018-04-09 13:00:21 +03:00
+								```
-												Fixed typo error

Fixed unknow -> unknown
											
										
										
											2022-01-25 21:49:26 +03:00
+								When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknown id cannot be disabled.  We can define an id for padding (&lt;pad&gt;) as ```--pad_id=3```.
-												Update README.md
											
										
										
											2018-04-09 13:00:21 +03:00
 								If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md).
-												Update README.md
											
										
										
											2018-06-08 18:26:48 +03:00
+								### Vocabulary restriction
-												minor spelling tweaks

											
										
										
											2018-12-08 18:14:20 +03:00
+								```spm_encode``` accepts a ```--vocabulary``` and a ```--vocabulary_threshold``` option so that ```spm_encode``` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).
-												Update README.md
											
										
										
											2018-06-08 18:26:48 +03:00
-												minor spelling tweaks

											
										
										
											2018-12-08 18:14:20 +03:00
+								The usage is basically the same as that of ```subword-nmt```. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:
-												Update README.md
											
										
										
											2018-06-08 18:26:48 +03:00
 								```
-												Update README.md
											
										
										
											2018-06-08 18:34:58 +03:00
+								% cat {train_file}.L1 {train_file}.L2 | shuffle > train
-												Add info about `--character_coverage` to README.md
											
										
										
											2018-08-17 00:19:48 +03:00
+								% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
-												Update README.md
											
										
										
											2018-06-08 18:26:48 +03:00
+								% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
 								% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
 								```
-												Update README.md
											
										
										
											2018-06-08 18:32:55 +03:00
+								```shuffle``` command is used just in case because ```spm_train``` loads the first 10M lines of corpus by default.
-												Update README.md
											
										
										
											2018-06-08 18:26:48 +03:00
 								Then segment train/test corpus with ```--vocabulary``` option
 								```
 								% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
 								% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2
 								```
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								## Advanced topics
-												Makes a new page for experiments.

											
										
										
											2018-04-29 07:35:33 +03:00
+								* [SentencePiece Experiments](doc/experiments.md)
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								* [SentencePieceProcessor C++ API](doc/api.md)
 								* [Use custom text normalization rules](doc/normalization.md)
 								* [Use custom symbols](doc/special_symbols.md)
-												Update README.md
											
										
										
											2019-01-14 10:50:35 +03:00
+								* [Python Module](python/README.md)
-												Initialize repository

											
										
										
											2017-03-07 13:43:50 +03:00
+								* [Segmentation and training algorithms in detail]
-												Update README.md

Resolves #603
											
										
										
											2021-01-04 19:47:51 +03:00