EfficientDL - 5. Quantization (Part I)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Quantization Can you quantize a pruned model?

  • yes, but do pruning first then quantization

Pruning reduces number of weights. quantization reduces number of bits for each weight (orthogonal techniques)

Very useful for deployment on edge.

Numerical Representations

Integer representations

Fixed-Point Number
  • non integer, but non “floating point”
Floating-Point Number (IEEE 754 Single Precision 32-bit float)

  • Exponent - allows for much larger dynamic range (difference between max and min value possible to be represented)
Representing zero

  • when exponent is zero the formula changes! subnormal numbers
    • instead of (1 + Fraction) Fraction only
What is the smallest possible number?

What is the largest possible number?

Range

Floating-Point Number (other 16bit, 8bit types)
  • Number of exponent bits important for dynamic range
  • Number of fraction bits important for precision
  • Google Brain Float (BF16)
    • take advantage of high dynamic range of fp32, but use 16bits to save half the storage
    • in practice, this is usually to converge a LLM training vs fp16
    • quite widely used now
  • In training want higher dynamic range (E5M2)
  • In inference want higher precision (E4M3)
  • H100 GPU has this NVIDIA FP8 types
Floating-Point Number (4bit - INT4 and FP4)

  • E1M2 not used often, just use INT4
  • Instead use E2M1
  • If use E3M0 completely log representation, not used often
  • as of 2022 qualcomm released chip snapdragon 8 gen 2, has 4 bit quantization!

Quantization Definition

  • process of constraining an input from a continuous or otherwise large set of values to a discrete set (p.397)

Naive Quantization Approach

K-Means-based Quantization Approach

  • use only 4 choices from the palette (“floating-point codebook”)
  • each weight from the 4x4 matrix must be chosen from this palette
  • reborn in context of real time LLM inference
    • weight is bottleneck use integer / low bit
    • compute is not bottleneck can afford to use fp16
Storing the weights

  • store the index as the weights integer weights
  • the palette is the centroids floating-point codebook
    • the values of the centroids are arbitrary mapping
    • not any form of linearly spaced mapping
Finetuning the weights

  • accumulate the gradient values from those indexes from the same centroid
  • then update the centroid values based on the gradients
  • this is 1 iteration of finetuning on top of the kmeans quantized weights
Decoding the weights

  • during runtime, the weights are decoded
  • save memory footprint since only need to fetch the 2 bit weights (the index) and decode on the fly (instead of a 32bit weight)
  • it only saves storage, compute time still uses full precision arithmetic
  • when workload is memory bounded this is still helpful!
    • e.g. running LLMs like llama2
    • generate each token for a 7B model at fp16, its 40gb memory access needed just to generate 1 token
    • hence crucial to save memory storage
Results of pruning + quantization using K-Means

  • 4 bits is enough for conv layer based on Deep Compression paper in 2016
  • for FC 2 bits enough
Huffman Coding
  • previously use same number of bits per each weight
    • how to use different number of bits per each weight?
  • can squeeze further last 1%
  • but decode will have some cost, not easy to implement in practice implementation has complexity
Summary of Deep Compression paper

  • pruning and quantization widely used in industry
  • only huffman encoding not used due to the implementation complexity more for the last few % but with additional cost.
  • still useful for scenario where you start with a small model! e.g. squeezenet

Linear Quantization Approach

  • now the palette / “codebook” is equally spaced vs k-means-based
    • no longer arbitrary number
  • less flexibility vs k-means-based
    • but now easier to decode due to the linear mapping
Defining this representation

How to get the zero point and scale?
Zero Point
  • is a “bias”
  • maps Z to zero
Scaling factor

  • with 2 unknowns, and
    • 2 equations to solve the 2 unknowns using and
    • how to calibrate and ? See Part II
    • we already know the number of bits to use, hence will know the and
Example calculation per above

  • since must be integer
  • from the table
  • from the weight matrix
Linear Quantized Arithmetic
  • how to use integer arithmetic for the following
Linear / Fully-Connected (FC) Layer (without bias) Matrix Multiplication

  • Plug in the integer representation of the matrix (per above definition of the representation)
  • expanded to the above (2nd line formula)
    • precompute - is known ahead of time
  • Scaling factors:
    • implemented by integer followed by a bit shift
  • weights distribution is usually centered around zero, so is usually symmetric - can force to zero
Linear / Fully-Connected (FC) Layer (with bias) Matrix Multiplication + Addition of bias

  • heavy lifting is done at
  • 32 bit Int Add usage to prevent overflow
Convolution Layer

Summary of results from linear quantization

  • doesnt lose alot of accuracy (between float vs 8-bit)
  • but latency reduced!