Fix readme indentation

This commit is contained in:
MOI Anthony 2019-11-18 16:34:13 -05:00 committed by GitHub
parent 1b32560067
commit 2d7c5f04f8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -8,15 +8,15 @@ vocabulary, and then process some text either in real time or in advance.
A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to
finally output a list of `Token`s. The various steps of the pipeline are:
- Some optional `Normalizer`s. An example would be a Unicode normalization step. They take
some raw text as input, and also output raw text `String`.
- An optional `PreTokenizer` which should take some raw text and take care of spliting
as relevant, and pre-processing tokens if needed. Takes a raw text `String` as input, and
outputs a `Vec<String>`.
- A `Model` to do the actual tokenization. An example of `Model` would be `BPE`. Takes
a `Vec<String>` as input, and gives a `Vec<Token>`.
- Some optional `PostProcessor`s. These are in charge of post processing the list of `Token`s
in any relevant way. This includes truncating, adding some padding, ...
- Some optional `Normalizer`s. An example would be a Unicode normalization step. They take
some raw text as input, and also output raw text `String`.
- An optional `PreTokenizer` which should take some raw text and take care of spliting
as relevant, and pre-processing tokens if needed. Takes a raw text `String` as input, and
outputs a `Vec<String>`.
- A `Model` to do the actual tokenization. An example of `Model` would be `BPE`. Takes
a `Vec<String>` as input, and gives a `Vec<Token>`.
- Some optional `PostProcessor`s. These are in charge of post processing the list of `Token`s
in any relevant way. This includes truncating, adding some padding, ...
## Try the shell