mirror of https://github.com/browsermt/marian-dev.git synced 2024-08-15 11:20:30 +03:00

Table of Contents

IntGEMM16 (I16)
IntGEMM8 (I8)

Create a 'fake' 8-bit model with
Fine-tune this model on the original training data
Quantize the fine-tuned model to I8
Memory footprint of quantised models

tiny11, 1 thread
base, 1 thread

The point of quantizing models is to allow faster (approximate) computation through integer arithmetic for general matrix multiplication, a.k.a. IntGEMM. We distinguish IntGEMM16 (16-bit quantization) and IntGEMM8 (8-bit quantization), or I16 or I8 for short. Quantization is particularly relevant for fast(er) inference on CPUs. The starting point in both cases is a trained model with 32-bit float (F32) parameters.

IntGEMM16 (I16)

I16 only leads to small losses in translation quality. Fine-tuning of the quantized model is normally not required. To quantize an F32 model into I16, run

TO BE ADDED!

IntGEMM8 (I8)

8-bit quantization without fine-tuning of the quantized model may lead to loss in translation quality. It is therefore recommend to fine-tune the model. To do so we

Create a 'fake' 8-bit model with

TO BE ADDED!

Fine-tune this model on the original training data

TO BE ADDED!

Quantize the fine-tuned model to I8

TO BE ADDED!

Memory footprint of quantised models

Memory footprint increases nearly linearly with the number of threads.

tiny11, 1 thread

Mini-batch 32, shortlist	Mini-batch 32, no shortlist	Mini-batch 1, shortlist	Mini-batch 1, noshortlist
227 MB	181 MB	177 MB	132 MB

base, 1 thread

Mini-batch 32, shortlist	Mini-batch 32, no shortlist	Mini-batch 1, shortlist	Mini-batch 1, noshortlist
423 MB	377 MB	287 MB	242 MB