Quantization


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, TinyML


Overview

Introduction

  • by far the most general model optimization method. Quantization reduces a model’s size by using fewer bits to represent its parameters, e.g. instead of using 32 bits to represent a float, use only 16 bits, or even 4 bits.

  • QLoRA used a combination of LoRA Low-Rank Adaptation of Large Language Models and quantization

Overall quantization process quantization_flow_chart

For Static Quantization

  • requires some form of calibration / preprocessing

For Dynamic Quantization

  • it will infer during runtime, could lead to slower inference

Calibration

The section above described how quantization from float32 to int8 works, but one question remains: how is the[a, b]range of float32 values determined? That is where calibration comes in to play.

Calibration is the step during quantization where the float32 ranges are computed. For weights it is quite easy since the actual range is know at quantization-time. But it is less clear for activations, and different approaches exist:

  1. Post training Dynamic Quantization: the range for each activation is computed on the fly atruntime. While this gives great results without too much work, it can be a bit slower than static quantization because of the overhead introduced by computing the range each time. It is also not an option on certain hardware.

  2. Post training Static Quantization: the range for each activation is computed at in advance at quantization-time, typically by passing representative data through the model and recording the activation values. In practice, the steps are:

    1. Observers are put on activations to record their values.
    2. A certain number of forward passes on a calibration dataset is done (around 200 examples is enough).
    3. The ranges for each computation are computed according to some calibration technique.
  3. Quantization Aware Training (QAT): the range for each activation is computed attraining-time, following the same idea than post training static quantization. But “fake quantize” operators are used instead of observers: they record values just as observers do, but they also simulate the error induced by quantization to let the model adapt to it.

For both post training static quantization and Quantization Aware Training (QAT), it is necessary to define calibration techniques, the most common are:

  • Min-max: the computed range is [min observed value, max observed value], this works well with weights.
  • Moving average min-max: the computed range is [moving average min observed value, moving average max observed value], this works well with activations.
  • Histogram: records a histogram of values along with min and max values, then chooses according to some criterion.
    • Entropy: the range is computed as the one minimizing the error between the full-precision and the quantized data.
    • Mean Square Error: the range is computed as the one minimizing the mean square error between the full-precision and the quantized data.
    • Percentile: the range is computed using a given percentile valuepon the observed values. The idea is to try to havep%of the observed values in the computed range. While this is possible when doing affine quantization, it is not always possible to exactly match that when doing symmetric quantization. You can check how it is done in ONNX Runtime for more details.

From <https://huggingface.co/docs/optimum/concept_guides/quantization#calibration>

Practical steps to follow to quantize a model to int8

To effectively quantize a model toint8, the steps to follow are:

  1. Choose which operators to quantize. Good operators to quantize are the one dominating it terms of computation time, for instance linear projections and matrix multiplications.
  2. Try post-training Dynamic Quantization, if it is fast enough stop here, otherwise continue to step 3.
  3. Try post-training Static Quantization which can be faster than dynamic quantization but often with a drop in terms of accuracy. Apply observers to your models in places where you want to quantize. implies defining which quantization scheme to use.
  4. Perform calibration.
  5. Convert the model to its quantized form: the observers are removed and the float32 operators are converted to their int8 counterparts.
  6. Evaluate the quantized model: is the accuracy good enough? If yes, stop here, otherwise start again at step 3 but with Quantization Aware Training (QAT) this time.

From <https://huggingface.co/docs/optimum/concept_guides/quantization#pratical-steps-to-follow-to-quantize-a-model-to-int8>

Practical Quantization in PyTorch | PyTorch

Quantization - Neural Network Distiller

Note:
For any of the methods below that require quantization-aware training, please see here for details on how to invoke it using Distiller’s scheduling mechanism.

Range-Based Linear Quantization

Let’s break down the terminology we use here:

  • Linear: Means a float value is quantized by multiplying with a numeric constant (the scale factor).
  • Range-Based: Means that in order to calculate the scale factor, we look at the actual range of the tensor’s values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor’s range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call clipping-based, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).

Asymmetric vs. Symmetric

In this method we can use two modes - asymmetric and symmetric.

Asymmetric Mode

In asymmetric mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a zero-point (also called quantization bias, or offset) in addition to the scale factor.

Let us denote the original floating-point tensor by , the quantized tensor by , the scale factor by , the zero-point by and the number of bits used for quantization by . Then, we get:

In practice, we actually use . This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively “nudge” the min/max values in the float range a little bit, in order to gain this exact quantization of zero.

Note that in the derivation above we use unsigned integer to represent the quantized range. That is, . One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting .

Let’s see how a convolution or fully-connected (FC) layer is quantized in asymmetric mode: (we denote input, output, weights and bias with and respectively)

Therefore:

Notes:

  • We can see that the bias has to be re-scaled to match the scale of the summation.
  • In a proper integer-only HW pipeline, we would like our main accumulation term to simply be . In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the gemmlowp documentation
Symmetric Mode

In symmetric mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don’t use a zero-point. So, the floating-point range we’re effectively quantizing is symmetric with respect to zero, and so is the quantized range.

There’s a nuance in the symmetric case with regards to the quantized range. Assuming , we can use either a “full” or “restricted” quantized range:

 Full RangeRestricted Range
Quantized Range
8-bit example
(As shown in image above)
Scale Factor

The restricted range is less accurate on-paper, and is usually used when specific HW considerations require it. Implementations of quantization “in the wild” that use a full range include PyTorch’s native quantization (from v1.3 onwards) and ONNX. Implementations that use a restricted range include TensorFlow, NVIDIA TensorRT and Intel DNNL (aka MKL-DNN). Distiller can emulate both modes.

Using the same notations as above, we get (regardless of full/restricted range):

Again, let’s see how a convolution or fully-connected (FC) layer is quantized, this time in symmetric mode:

Therefore:

Comparing the Two Modes

The main trade-off between these two modes is simplicity vs. utilization of the quantized range.

  • When using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we’ll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we’re effectively losing 1 bit.
  • On the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation.

Questions

LLM.int8() - 8 bit multiplication

  • solution to outliers causing problems in quantization
    • Outlier features are extreme values (negative or positive) that appear in all transformer layers when the model reach a certain scale (>6.7B parameters).
    • is an issue since a single outlier can reduce the precision for all other values
    • discarding outlier features not an option since it would degrade the model’s performance
  • relies on a vector-wise (absmax) quantization scheme
  • introduces mixed-precision quantization
    • outlier features are processed in a FP16 format to retain their precision
    • other values are processed in an INT8 format
    • outliers represent about 0.1% of values, this effectively reduces the memory footprint of the LLM by almost 2x.
    • additional cost in terms of computation: LLM.int8() is roughly about 20% slower for large models

Weight Quantization, vs Activation Quantization

  • weights: trained parameters weights stored after training
  • activations: forward pass computed features
  • Weight-only is the most common type of quantization approach
    • notation would be like W8A16 → i.e. weight 8 bit, activations left at 16 bit
    • weight quantization would be done offline, prior to inference
    • then during inference, the weights will be dequantized on the fly to F16 for multiplication in order to get F16 for activations, and the weights themselves still stay in memory as W8
  • activation quantization would be done by the runtime engine at execution time
    • e.g. pure torch (easy) vs llama.cpp (not supported), where llama.cpp cannot have activations quantized
    • but the kernel level must also support the activation datatype else might not be supported

Naive quantization

  • Round-to-Nearest (RTN)

Activation-aware Weight Quantization (AWQ)

  • Found that weights are not all equally important
    • e.g. weights in FP16 all quant to INTx, Perplexity Metric (PPL) increases as performance drops
    • BUT, when now restore 1% of the INTx weights back to FP16 (i.e. mixed precision), realised that the PPL can drop again –> i.e. keeping 1% of salient weights (from some channels) can already improve perf alot
  • Question is how to select the important channels?
  • Answer is by selection from observing the activations magnitudes
    • see the largest activations (from ) to determine the corresponding weights (from ) would be the most important
    • since then the corresponding output activations () after that layer would likely be the greatest
  • Next question is how to remove mixed precision?
    • just multiply weight channel by 2 (and divide activation channel by 2)

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks