mirror of
https://github.com/hasktorch/tokenizers.git
synced 2024-10-26 06:00:08 +03:00
1.5 KiB
1.5 KiB
Tokenizers
Provides an implementation of today's most used tokenizers with a focus on performances and versatility. The goal is to make it as easy as possible to construct a Tokenizer, learn a vocabulary, and then process some text either in real time or in advance.
What is a Tokenizer
A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to
finally output a list of Token
s. The various steps of the pipeline are:
- Some optional
Normalizer
s. An example would be a Unicode normalization step. They take some raw text as input, and also output raw textString
. - An optional
PreTokenizer
which should take some raw text and take care of spliting as relevant, and pre-processing tokens if needed. Takes a raw textString
as input, and outputs aVec<String>
. - A
Model
to do the actual tokenization. An example ofModel
would beBPE
. Takes aVec<String>
as input, and gives aVec<Token>
. - Some optional
PostProcessor
s. These are in charge of post processing the list ofToken
s in any relevant way. This includes truncating, adding some padding, ...
Try the shell
You can try a simple ByteLevel BPE Tokenizer by using the following command. This expects
vocab.json
and merges.txt
files, trained with ByteLevel BPE.
cd tokenizers
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cargo run --release shell --vocab gpt2-vocab.json --merges gpt2-merges.txt