💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Go to file
2020-01-04 23:31:02 -05:00
.github/workflows refactor benchmarks (#25) 2020-01-01 17:07:36 -08:00
bindings Python - Add bert wordpiece training example 2020-01-03 19:37:29 -05:00
tokenizers Update training to include new lines 2020-01-03 20:23:58 -05:00
.gitignore remove Cargo.lock (#7) 2019-12-23 21:22:42 -08:00
LICENSE Create LICENSE 2020-01-04 23:31:02 -05:00
README.md Fix readme indentation 2019-11-18 16:34:13 -05:00

Tokenizers

Provides an implementation of today's most used tokenizers with a focus on performances and versatility. The goal is to make it as easy as possible to construct a Tokenizer, learn a vocabulary, and then process some text either in real time or in advance.

What is a Tokenizer

A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to finally output a list of Tokens. The various steps of the pipeline are:

  • Some optional Normalizers. An example would be a Unicode normalization step. They take some raw text as input, and also output raw text String.
  • An optional PreTokenizer which should take some raw text and take care of spliting as relevant, and pre-processing tokens if needed. Takes a raw text String as input, and outputs a Vec<String>.
  • A Model to do the actual tokenization. An example of Model would be BPE. Takes a Vec<String> as input, and gives a Vec<Token>.
  • Some optional PostProcessors. These are in charge of post processing the list of Tokens in any relevant way. This includes truncating, adding some padding, ...

Try the shell

You can try a simple ByteLevel BPE Tokenizer by using the following command. This expects vocab.json and merges.txt files, trained with ByteLevel BPE.

cd tokenizers
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cargo run --release shell --vocab gpt2-vocab.json --merges gpt2-merges.txt