mirror of
https://github.com/google/sentencepiece.git
synced 2024-10-26 11:38:45 +03:00
Removes reference to that Korean has no spaces
Removes reference to that Korean has no spaces, due to that it has. [Reference](http://www.koreanwikiproject.com/wiki/Word_spacing)
This commit is contained in:
parent
bca47c0eb9
commit
bd0ea9b196
@ -60,7 +60,7 @@ The number of merge operations is a BPE-specific parameter and not applicable to
|
||||
|
||||
#### Trains from raw sentences
|
||||
Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
|
||||
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese, Japanese and Korean where no explicit spaces exist between words.
|
||||
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.
|
||||
|
||||
#### Whitespace is treated as a basic symbol
|
||||
The first step of Natural Language processing is text tokenization. For
|
||||
|
Loading…
Reference in New Issue
Block a user