EfficientDL - 5. Quantization (Part I)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Quantization Can you quantize a pruned model?

yes, but do pruning first then quantization

Pruning reduces number of weights. quantization reduces number of bits for each weight (orthogonal techniques)

Very useful for deployment on edge.

Numerical Representations

Integer representations

Fixed-Point Number

non integer, but non “floating point”

Floating-Point Number (IEEE 754 Single Precision 32-bit float)

Exponent - allows for much larger dynamic range (difference between max and min value possible to be represented)

Representing zero

when exponent is zero ⇒ the formula changes! subnormal numbers
- instead of (1 + Fraction) ⇒ Fraction only

What is the smallest possible number?

What is the largest possible number?

Range

Floating-Point Number (other 16bit, 8bit types)

Number of exponent bits ⇒ important for dynamic range
Number of fraction bits ⇒ important for precision
Google Brain Float (BF16)
- take advantage of high dynamic range of fp32, but use 16bits to save half the storage
- in practice, this is usually to converge a LLM training vs fp16
- quite widely used now
In training want higher dynamic range (E5M2)
In inference want higher precision (E4M3)
H100 GPU ⇒ has this NVIDIA FP8 types

Floating-Point Number (4bit - INT4 and FP4)

E1M2 not used often, just use INT4
Instead use E2M1
If use E3M0 ⇒ completely log representation, not used often
as of 2022 qualcomm released chip snapdragon 8 gen 2, has 4 bit quantization!

Quantization Definition

process of constraining an input from a continuous or otherwise large set of values to a discrete set (p.397)

Naive Quantization Approach

K-Means-based Quantization Approach

use only 4 choices from the palette (“floating-point codebook”)
each weight from the 4x4 matrix must be chosen from this palette
reborn in context of real time LLM inference
- weight is bottleneck ⇒ use integer / low bit
- compute is not bottleneck ⇒ can afford to use fp16

Storing the weights

store the index as the weights ⇒ integer weights
the palette is the centroids ⇒ floating-point codebook
- the values of the centroids are arbitrary mapping
- not any form of linearly spaced mapping

Finetuning the weights

accumulate the gradient values from those indexes from the same centroid
then update the centroid values based on the gradients
this is 1 iteration of finetuning on top of the kmeans quantized weights

Decoding the weights

during runtime, the weights are decoded
save memory footprint since only need to fetch the 2 bit weights (the index) and decode on the fly (instead of a 32bit weight)
it only saves storage, compute time still uses full precision arithmetic
when workload is memory bounded this is still helpful!
- e.g. running LLMs like llama2
- generate each token for a 7B model at fp16, its 40gb memory access needed just to generate 1 token
- hence crucial to save memory storage

Results of pruning + quantization using K-Means

4 bits is enough for conv layer based on Deep Compression paper in 2016
for FC 2 bits enough

Huffman Coding

previously use same number of bits per each weight
- how to use different number of bits per each weight?
can squeeze further last 1%
but decode will have some cost, not easy to implement in practice ⇒ implementation has complexity

Summary of Deep Compression paper

pruning and quantization widely used in industry
only huffman encoding not used due to the implementation complexity ⇒ more for the last few % but with additional cost.
still useful for scenario where you start with a small model! e.g. squeezenet

Linear Quantization Approach

now the palette / “codebook” is equally spaced vs k-means-based
- no longer arbitrary number
less flexibility vs k-means-based
- but now easier to decode due to the linear mapping

Defining this representation

How to get the zero point and scale?

Zero Point

is a “bias”
maps Z to zero

Scaling factor

with 2 unknowns, $S$ and $Z$
- 2 equations to solve the 2 unknowns using $r_{ma x}$ and $r_{min}$
- how to calibrate $r_{ma x}$ and $r_{min}$ ? See Part II
- we already know the number of bits to use, hence will know the $q_{min}$ and $q_{ma x}$

Example calculation per above

$round ()$ since $Z$ must be integer
$q_{min}$ from the table
$r_{min}$ from the weight matrix

Linear Quantized Arithmetic

how to use integer arithmetic for the following

Linear / Fully-Connected (FC) Layer (without bias) ⇒ Matrix Multiplication

Plug in the integer representation of the matrix (per above definition of the representation)
expanded to the above (2nd line formula)
- precompute - is known ahead of time
Scaling factors:
- implemented by integer followed by a bit shift
weights distribution is usually centered around zero, so $Z_{W}$ is usually symmetric - can force to zero

Linear / Fully-Connected (FC) Layer (with bias) ⇒ Matrix Multiplication + Addition of bias

heavy lifting is done at $q_{W} q_{X}$
32 bit Int Add usage to prevent overflow

Convolution Layer

Summary of results from linear quantization

doesnt lose alot of accuracy (between float vs 8-bit)
but latency reduced!

Darius Knowledge Hub

Explorer

EfficientDL - 5. Quantization (Part I)

EfficientDL - 5. Quantization (Part I)

Numerical Representations

Integer representations

Fixed-Point Number

Floating-Point Number (IEEE 754 Single Precision 32-bit float)

Representing zero

What is the smallest possible number?

What is the largest possible number?

Range

Floating-Point Number (other 16bit, 8bit types)

Floating-Point Number (4bit - INT4 and FP4)

Quantization Definition

Naive Quantization Approach

K-Means-based Quantization Approach

Storing the weights

Finetuning the weights

Decoding the weights

Results of pruning + quantization using K-Means

Huffman Coding

Summary of Deep Compression paper

Linear Quantization Approach

Defining this representation

How to get the zero point and scale?

Zero Point

Scaling factor

Example calculation per above

Linear Quantized Arithmetic

Linear / Fully-Connected (FC) Layer (without bias) ⇒ Matrix Multiplication

Linear / Fully-Connected (FC) Layer (with bias) ⇒ Matrix Multiplication + Addition of bias

Convolution Layer

Summary of results from linear quantization

Graph View

Backlinks