EfficientDL - 6. Quantization (Part II)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Quantization

Post-Training Quantization (p.460)

How should we get the optimal linear quantization parameters (S, Z)? (p.460)

Topic I: Quantization Granularity (p.461)

  • hardware friendly, more storage efficient - per-tensor
  • higher accuracy with higher granularity - per-vector
Per-Tensor Quantization (p.464)

Using single scale for whole weight tensor (p.464) • works well for large models (p.464) • accuracy drops for small models • Common failure results from • large differences (more than 100×) in ranges of weights for different output channels — outlier weight

  • but since there are many different channels, different channels are quite different from each other in terms of the range of values
    • unfair to use single max / scaling factor for the whole tensor is not great use per-channel instead
Per-Channel Weight Quantization (p.466)

  • reconstruction error at the end is smaller for per-channel vs per-tensor!
  • But the overhead is the storage of 4 32-bit scaling factors instead of 1
    • in LLMs easily can have 10k channels 10k 32-bit scaling factors
  • works super well for medium sized models
    • but with LLMs, need to be even finer granularity
Group Quantization (p.471)

VS-Quant: Per-vector Scaled Quantization (p.472)

  • each tensor has a scaling factor
  • each sub vector - not as big as the whole channel
    • e.g. every 16/32/64/128 sized vector
    • have 1 dedicated private scaling factor
  • hierarchical scaling factor! first hierarchy, coarse grained and second hierarchy fine grained

(p.472) • is a floating-point coarse grained scale factor

  • low overhead, shared across the tensor can be high precision 32bit value

is an integer per-vector scale factor

  • high overhead since many can be low precision integer instead

• achieves a balance between accuracy and hardware efficiency by • less expensive integer scale factors at finer granularity • more expensive floating-point scale factors at coarser granularity

  • Shared Micro-exponent (MX) data type - by Microsoft
    • S1M2 - 1 sign bit, 2 bits of mantissa
    • exponent is in L0 Scale (1 bit exponent shared by 2 elements),
    • L1 Scale (8 bit exponent every 16 elements)
    • why called MX9? - S1+M7 = 8 bits, + 1/2 from E1M0 from 2 bits, 1/2 from E8M0 from 16 bits = 9 bits
  • getting popular for LLMs
  • how to share, what is the hierarchy, what is the precision for each level
  • how to design good data type for good tradeoff between accuracy and hardware efficiency esp for LLMs
  • below 4 bits will need to use group quantization
Topic II: Dynamic Range Clipping (p.477)
  • Unlike weights, the activation range varies across inputs (p.478)
  • To determine the floating-point range, the activations statistics are gathered before deploying the model (p.478) Method 1: Method 2:
  • during runtime, using calibration dataset that is separate from test dataset
  • how to get the statistics from the calibration set?
    • use K-L divergence to find the part to do clipping across the values
    • method used in NVIDIA TensorRT
      • K-L divergence says to clip at the line
      • afterwards all the points on the right of the line are rounded to that value hence there is an abrupt single point in the right graph
Topic III: Rounding (p.486)
  • Rounding-to-nearest is not optimal (p.487)
    • Weights are correlated with each other. The best rounding for each weight (to nearest) is not the best rounding for the whole tensor (p.487)
  • What is then optimal?
    • Rounding that reconstructs the original activation the best, which may be very different (p.487)
  • add learnable between 0 and 1
  • minimise the error between pre and post rounding

Quantization-Aware Training (p.493)

How should we improve performance of quantized models? (p.493)

  • for weight quantization

    • important to keep the full precision copy of the weights during training in order to track the accumulation of gradients during SGD that could cause rounding to differ at the quantization step
    • at runtime, no longer need the full precision weights, just use the quantized weights
  • simulated / fake

    • cos still using the full precision weights, inputs and outputs
  • How should gradients back-propagate through the (simulated) quantization? (p.498)

  • standard training flow

    • train from scratch a quantized model (in 2016 standard)
    • train floating point to converge -> quantize using PTQ -> then finetune with QAT (nowadays)

Binary/Ternary Quantization (p.504)

Can we push the quantization precision to 1 bit? (p.504)

Binary quantizing weights

  • above example is deterministic binarization
Deterministic Binarization (p.507)
  • computes the bit value based on a threshold, usually 0, resulting in a sign function (p.507)
Stochastic Binarization (p.507)
  • use global statistics or the value of input data to determine the probability of being -1 or +1 (p.507)
  • harder to implement as it requires the hardware to generate random bits when quantizing (p.507)
Binary quantizing weights + activation

  • can do the same dot + sum using xnor since pattern is the same, just difference is 2
  • popcount is efficient hardware operation to do count of number of ones in an array
  • arithmetic to do dot product of 2 binary vectors
      • where << 1 left shift by 1 (i.e. multiply by 2)
    • very hardware efficient, no multiplication
Ternary quantizing weights
  • Ternary Weight Networks (TWN) (p.516)
    • Weights are quantized to +1, -1 and 0
    • where threshold is just some heuristic using magic number 0.7, and expectation
  • Trained Ternary Quantization (TTQ) (p.517)
    • stops using the heuristic, replaced with trainable parameters

computation is cheap, memory reference is expensive

  • so, quantizing more and more have diminishing return
  • 8 - 2 bit 4 times less memory, 16 times less compute
    • compute is quadratic with number of bits
    • aggressively reducing bits, linear reduction in memory, quadratic return on compute
    • 4 bit is sweet spot!

Mixed-Precision Quantization (p.519)

  • Uniform Quantization
  • Mixed-Precision Quantization
    • has very large design space
    • solution = design automation!
    • but difficult to implement in engineering effort
      • dealing with compiler and toolchain to exploit mixed precision quantisation

In practice:

  • do coarse grained precision - conv layer use 1 precision, FC layer use another precision
  • balance engineering complexity and performance