Merge pull request #1012 from Cassini-chris/patch-2

adding vocab_size consistency
This commit is contained in:
Taku Kudo 2024-05-22 15:38:53 +09:00 committed by GitHub
commit 58b550871d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -250,7 +250,7 @@
"- **user defined symbols**: Always treated as one token in any context. These symbols can appear in the input sentence. \n",
"- **control symbol**: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.\n",
"\n",
"For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text."
"For experimental purposes, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text."
]
},
{
@ -273,7 +273,7 @@
"\n",
"# ids are reserved in both mode.\n",
"# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4\n",
"# user defined symbols allow these symbol to apper in the text.\n",
"# user defined symbols allow these symbols to appear in the text.\n",
"print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))\n",
"print(sp_user.piece_to_id('<sep>')) # 3\n",
"print(sp_user.piece_to_id('<cls>')) # 4\n",
@ -605,7 +605,7 @@
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')\n",
"\n",
"# Can obtain different segmentations per request.\n",
"# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.\n",
"# There are two hyperparameters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.\n",
"for n in range(10):\n",
" print(sp.sample_encode_as_pieces('hello world', -1, 0.1))\n",
"\n",
@ -760,7 +760,7 @@
"Sentencepiece supports character and word segmentation with **--model_type=char** and **--model_type=character** flags.\n",
"\n",
"In `word` segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized.\n",
"We can apply different segmentation algorithm transparently without changing pre/post processors."
"We can apply different segmentation algorithms transparently without changing pre/post processors."
]
},
{
@ -775,7 +775,7 @@
},
"cell_type": "code",
"source": [
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')\n",
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=2000')\n",
"\n",
"sp_char = spm.SentencePieceProcessor()\n",
"sp_char.load('m_char.model')\n",
@ -884,7 +884,7 @@
"cell_type": "markdown",
"source": [
"The normalization is performed with user-defined string-to-string mappings and leftmost longest matching.\n",
"We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalziation rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.\n",
"We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalization rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.\n",
"\n",
"Here's the example of custom normalization. The TSV file is fed with **--normalization_rule_tsv=&lt;FILE&gt;** flag."
]
@ -921,7 +921,7 @@
"sp = spm.SentencePieceProcessor()\n",
"# m.model embeds the normalization rule compiled into an FST.\n",
"sp.load('m.model')\n",
"print(sp.encode_as_pieces(\"I'm busy\")) # normalzied to `I am busy'\n",
"print(sp.encode_as_pieces(\"I'm busy\")) # normalized to `I am busy'\n",
"print(sp.encode_as_pieces(\"I don't know it.\")) # normalized to 'I do not know it.'"
],
"execution_count": 0,
@ -995,7 +995,7 @@
"source": [
"## Vocabulary restriction\n",
"\n",
"We can encode the text only using the tokens spececified with **set_vocabulary** method. The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt)."
"We can encode the text only using the tokens specified with **set_vocabulary** method. The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt)."
]
},
{