Static Quantization


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, TinyML


PyTorch Static Quantization - Lei Mao’s Log Book Static quantization quantizes the weights and activations of the model. It allows the user to fuse activations into preceding layers where possible. Unlike Dynamic Quantization, where the scales and zero points were collected during inference, the scales and zero points for static quantization were determined prior to inference using a representative dataset. Therefore, static quantization is theoretically faster than dynamic quantization while the model size and memory bandwidth consumptions remain to be the same. Therefore, statically quantized models are more favorable for inference than dynamic quantization models.

Post training Static Quantization: the range for each activation is computed at in advance at quantization-time, typically by passing representative data through the model and recording the activation values. In practice, the steps are: 1. Observers are put on activations to record their values. 2. A certain number of forward passes on a calibration dataset is done (around 200 examples is enough). 3. The ranges for each computation are computed according to some calibration technique.

it is necessary to define calibration techniques, the most common are:

  • Min-max: the computed range is [min observed value, max observed value], this works well with weights.
  • Moving average min-max: the computed range is [moving average min observed value, moving average max observed value], this works well with activations.
  • Histogram: records a histogram of values along with min and max values, then chooses according to some criterion.
    • Entropy: the range is computed as the one minimizing the error between the full-precision and the quantized data.
    • Mean Square Error: the range is computed as the one minimizing the mean square error between the full-precision and the quantized data.
    • Percentile: the range is computed using a given percentile valuepon the observed values. The idea is to try to havep%of the observed values in the computed range. While this is possible when doing affine quantization, it is not always possible to exactly match that when doing symmetric quantization. You can check how it is done in ONNX Runtime for more details.

*From <https://huggingface.co/docs/optimum/concept_guides/quantization#calibration>