The point of quantizing models is to allow faster (approximate) computation through integer arithmetic for general matrix multiplication, a.k.a. IntGEMM. We distinguish IntGEMM16 (16-bit quantization) and IntGEMM8 (8-bit quantization), or I16 or I8 for short. Quantization is particularly relevant for fast(er) inference on CPUs. The starting point in both cases is a trained model with 32-bit float (F32) parameters.
IntGEMM16 (I16)
I16 only leads to small losses in translation quality. Fine-tuning of the quantized model is normally not required.
To quantize an F32 model into I16, run
TO BE ADDED!
IntGEMM8 (I8)
8-bit quantization without fine-tuning of the quantized model may lead to loss in translation quality. It is therefore recommend to fine-tune the model. To do so we
Create a 'fake' 8-bit model with
TO BE ADDED!
Fine-tune this model on the original training data
TO BE ADDED!
Quantize the fine-tuned model to I8
TO BE ADDED!
Memory footprint increases nearly linearly with the number of threads.
tiny11, 1 thread
Mini-batch 32, shortlist |
Mini-batch 32, no shortlist |
Mini-batch 1, shortlist |
Mini-batch 1, noshortlist |
227 MB |
181 MB |
177 MB |
132 MB |
base, 1 thread
Mini-batch 32, shortlist |
Mini-batch 32, no shortlist |
Mini-batch 1, shortlist |
Mini-batch 1, noshortlist |
423 MB |
377 MB |
287 MB |
242 MB |