EfficientDL - 6. Quantization (Part II)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Quantization

Post-Training Quantization (p.460)

How should we get the optimal linear quantization parameters (S, Z)? (p.460)

Topic I: Quantization Granularity (p.461)

hardware friendly, more storage efficient - per-tensor
higher accuracy with higher granularity - per-vector

Per-Tensor Quantization (p.464)

Using single scale for whole weight tensor (p.464) • works well for large models (p.464) • accuracy drops for small models • Common failure results from • large differences (more than 100×) in ranges of weights for different output channels — outlier weight

but since there are many different channels, different channels are quite different from each other in terms of the range of values
- unfair to use single max / scaling factor for the whole tensor is not great ⇒ use per-channel instead

Per-Channel Weight Quantization (p.466)

reconstruction error at the end is smaller for per-channel vs per-tensor!
But the overhead is the storage of 4 32-bit scaling factors $S_{i}$ instead of 1 $S$
- in LLMs easily can have 10k channels ⇒ 10k 32-bit scaling factors
works super well for medium sized models
- but with LLMs, need to be even finer granularity

Group Quantization (p.471)

VS-Quant: Per-vector Scaled Quantization (p.472)

each tensor has a $γ$ scaling factor
each sub vector - not as big as the whole channel
- e.g. every 16/32/64/128 sized vector
- have 1 dedicated private scaling factor $S_{q}$
hierarchical scaling factor! first hierarchy, coarse grained $γ$ and second hierarchy fine grained $S_{q}$

(p.472) • $γ$ is a floating-point coarse grained scale factor

low overhead, shared across the tensor ⇒ can be high precision 32bit value

• $S_{q}$ is an integer per-vector scale factor

high overhead since many ⇒ can be low precision integer instead

• achieves a balance between accuracy and hardware efficiency by • less expensive integer scale factors at finer granularity • more expensive floating-point scale factors at coarser granularity

Shared Micro-exponent (MX) data type - by Microsoft
- S1M2 - 1 sign bit, 2 bits of mantissa
- exponent is in L0 Scale (1 bit exponent shared by 2 elements),
- L1 Scale (8 bit exponent every 16 elements)
- why called MX9? - S1+M7 = 8 bits, + 1/2 from E1M0 from 2 bits, 1/2 from E8M0 from 16 bits = 9 bits
getting popular for LLMs
how to share, what is the hierarchy, what is the precision for each level
how to design good data type for good tradeoff between accuracy and hardware efficiency ⇒ esp for LLMs
below 4 bits will need to use group quantization

Topic II: Dynamic Range Clipping (p.477)

Unlike weights, the activation range varies across inputs (p.478)
To determine the floating-point range, the activations statistics are gathered before deploying the model (p.478) Method 1: Method 2:
during runtime, using calibration dataset that is separate from test dataset
how to get the statistics from the calibration set?
- use K-L divergence to find the part to do clipping across the values
- method used in NVIDIA TensorRT
- - K-L divergence says to clip at the line
  - afterwards all the points on the right of the line are rounded to that value ⇒ hence there is an abrupt single point in the right graph

Topic III: Rounding (p.486)

Rounding-to-nearest is not optimal (p.487)
- Weights are correlated with each other. The best rounding for each weight (to nearest) is not the best rounding for the whole tensor (p.487)
What is then optimal?
- Rounding that reconstructs the original activation the best, which may be very different (p.487)
add learnable $δ$ between 0 and 1
minimise the error between pre and post rounding

Quantization-Aware Training (p.493)

How should we improve performance of quantized models? (p.493)

for weight quantization
- important to keep the full precision copy of the weights during training in order to track the accumulation of gradients during SGD that could cause rounding to differ at the quantization step
- at runtime, no longer need the full precision weights, just use the quantized weights
simulated / fake
- cos still using the full precision weights, inputs and outputs
How should gradients back-propagate through the (simulated) quantization? (p.498)
standard training flow
- train from scratch a quantized model (in 2016 standard)
- train floating point to converge -> quantize using PTQ -> then finetune with QAT (nowadays)

Binary/Ternary Quantization (p.504)

Can we push the quantization precision to 1 bit? (p.504)

Binary quantizing weights

above example is deterministic binarization

Deterministic Binarization (p.507)

computes the bit value based on a threshold, usually 0, resulting in a sign function (p.507)

Stochastic Binarization (p.507)

use global statistics or the value of input data to determine the probability of being -1 or +1 (p.507)
harder to implement as it requires the hardware to generate random bits when quantizing (p.507)

Binary quantizing weights + activation

can do the same dot + sum using xnor since pattern is the same, just difference is 2
popcount is efficient hardware operation to do count of number of ones in an array
arithmetic to do dot product of 2 binary vectors
- - where << 1 ⇒ left shift by 1 (i.e. multiply by 2)
- very hardware efficient, no multiplication

Ternary quantizing weights

Ternary Weight Networks (TWN) (p.516)
- Weights are quantized to +1, -1 and 0
- where threshold $Δ$ is just some heuristic using magic number 0.7, and expectation
Trained Ternary Quantization (TTQ) (p.517)
- stops using the heuristic, replaced with trainable parameters

computation is cheap, memory reference is expensive

so, quantizing more and more ⇒ have diminishing return
8 - 2 bit ⇒ 4 times less memory, 16 times less compute
- compute is quadratic with number of bits
- aggressively reducing bits, linear reduction in memory, quadratic return on compute
- 4 bit is sweet spot!

Mixed-Precision Quantization (p.519)

Uniform Quantization
Mixed-Precision Quantization
- has very large design space
- solution = design automation!
- but difficult to implement in engineering effort
  - dealing with compiler and toolchain to exploit mixed precision quantisation

In practice:

do coarse grained precision - conv layer use 1 precision, FC layer use another precision
balance engineering complexity and performance

Darius Knowledge Hub

Explorer

EfficientDL - 6. Quantization (Part II)

EfficientDL - 6. Quantization (Part II)

Post-Training Quantization (p.460)

Topic I: Quantization Granularity (p.461)

Per-Tensor Quantization (p.464)

Per-Channel Weight Quantization (p.466)

Group Quantization (p.471)

Topic II: Dynamic Range Clipping (p.477)

Topic III: Rounding (p.486)

Quantization-Aware Training (p.493)

Binary/Ternary Quantization (p.504)

Binary quantizing weights

Deterministic Binarization (p.507)

Stochastic Binarization (p.507)

Binary quantizing weights + activation

Ternary quantizing weights

Mixed-Precision Quantization (p.519)

Graph View

Backlinks