mirror of
https://github.com/google/sentencepiece.git
synced 2024-12-29 11:11:58 +03:00
Fix link to nfkc.tsv in normalization.md
This commit is contained in:
parent
995460307e
commit
cefb97b59d
@ -34,7 +34,7 @@ You can use custom normalization rule by preparing a TSV file formatted as follo
|
||||
```
|
||||
In this sample, UCS4 sequence [41 302 300] (hex) is converted into [1EA6] (hex). When there are ambiguities in the conversions, the longest rule is used.
|
||||
Note that the tab is used as a delimiter for source and target sequence and space is used as a delimiter for UCS4 characters. We can make the target sequence empty to remove some specific characters from the text.
|
||||
See [data/nfkc.tsv](data/nfkc.tsv) as an example. Once a TSV file is prepared, you can specify it with `--normalization_rule_tsv` flag.
|
||||
See [data/nfkc.tsv](../data/nfkc.tsv) as an example. Once a TSV file is prepared, you can specify it with `--normalization_rule_tsv` flag.
|
||||
```
|
||||
% spm_train --normalization_rule_tsv=<rule tsv file> --input=<input> --model_prefix=<model file> --vocab_size=8000
|
||||
```
|
||||
|
Loading…
Reference in New Issue
Block a user