mirror of https://github.com/hasktorch/tokenizers.git synced 2024-08-16 00:30:28 +03:00

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Go to file

MOI Anthony 627c304721 Create LICENSE		2020-01-04 23:31:02 -05:00
.github/workflows	refactor benchmarks (#25 )	2020-01-01 17:07:36 -08:00
bindings	Python - Add bert wordpiece training example	2020-01-03 19:37:29 -05:00
tokenizers	Update training to include new lines	2020-01-03 20:23:58 -05:00
.gitignore	remove Cargo.lock (#7 )	2019-12-23 21:22:42 -08:00
LICENSE	Create LICENSE	2020-01-04 23:31:02 -05:00
README.md	Fix readme indentation	2019-11-18 16:34:13 -05:00

README.md

Tokenizers

Provides an implementation of today's most used tokenizers with a focus on performances and versatility. The goal is to make it as easy as possible to construct a Tokenizer, learn a vocabulary, and then process some text either in real time or in advance.

What is a Tokenizer

A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to finally output a list of Tokens. The various steps of the pipeline are:

Some optional Normalizers. An example would be a Unicode normalization step. They take some raw text as input, and also output raw text String.
An optional PreTokenizer which should take some raw text and take care of spliting as relevant, and pre-processing tokens if needed. Takes a raw text String as input, and outputs a Vec<String>.
A Model to do the actual tokenization. An example of Model would be BPE. Takes a Vec<String> as input, and gives a Vec<Token>.
Some optional PostProcessors. These are in charge of post processing the list of Tokens in any relevant way. This includes truncating, adding some padding, ...

Try the shell

You can try a simple ByteLevel BPE Tokenizer by using the following command. This expects vocab.json and merges.txt files, trained with ByteLevel BPE.

cd tokenizers
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cargo run --release shell --vocab gpt2-vocab.json --merges gpt2-merges.txt