2 Creating Quantized Models
Nikolay Bogoychev edited this page 2020-11-13 16:01:19 +00:00

The point of quantizing models is to allow faster (approximate) computation through integer arithmetic for general matrix multiplication, a.k.a. IntGEMM. We distinguish IntGEMM16 (16-bit quantization) and IntGEMM8 (8-bit quantization), or I16 or I8 for short. Quantization is particularly relevant for fast(er) inference on CPUs. The starting point in both cases is a trained model with 32-bit float (F32) parameters.

IntGEMM16 (I16)

I16 only leads to small losses in translation quality. Fine-tuning of the quantized model is normally not required. To quantize an F32 model into I16, run

TO BE ADDED!

IntGEMM8 (I8)

8-bit quantization without fine-tuning of the quantized model may lead to loss in translation quality. It is therefore recommend to fine-tune the model. To do so we

Create a 'fake' 8-bit model with

TO BE ADDED!

Fine-tune this model on the original training data

TO BE ADDED!

Quantize the fine-tuned model to I8

TO BE ADDED!

Memory footprint of quantised models

Memory footprint increases nearly linearly with the number of threads.

tiny11, 1 thread

Mini-batch 32, shortlist Mini-batch 32, no shortlist Mini-batch 1, shortlist Mini-batch 1, noshortlist
227 MB 181 MB 177 MB 132 MB

base, 1 thread

Mini-batch 32, shortlist Mini-batch 32, no shortlist Mini-batch 1, shortlist Mini-batch 1, noshortlist
423 MB 377 MB 287 MB 242 MB