EfficientDL - 3. Pruning and Sparsity (Part I)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Pruning

MLPerf (the Olympic Game for AI Computing) (p.180)

  • Key techniques: pruning, distillation, quantization (p.181)
  • QAT + Pruning + Distillation biggest speedup for the MLPerf results

neural network pruning which can:

  • reduce the parameter counts of neural networks by more than 90%,
  • decreasing storage requirements,
  • improving computation efficiency of neural networks (p.182)

per energy consumption slide previously, want to reduce memory referencing as much as possible want to reduce the weights and activations to save the memory reference which saves the dram accessing which saves the battery

  • normally will be starting with a trained model first, then prune and finetune iteratively
  • if start from untrained model to prune, will be challenging

Pruning Ratio

  • if start directly with a very high pruning ratio, rest of the parameters will be difficult to recover the accuracy
    • if do 50 % > 30 % iteratively, it will be better to do from 50 % > 40 % > 30 % instead to gradually prune
  • larger the model to begin with, can potentially have a larger pruning ratio

Pruning Formulation

p.198

Pruning Granularity

  • different granularities, from structured to non-structured (p.204)

Structural Pruning, Structural Pruning vs Unstructured Pruning

  • structured better hardware pruning efficiency, but prune less

  • unstructured worse hardware efficiency, but most flexible pruning choice p.207

  • weights of convolutional layers have 4 dimensions (p.208)

  • 4 dimensions give us more choices to select pruning granularities (p.208)

  • fine-grained Usually larger compression ratio since we can flexibly find “redundant” weights (p.213)

Pattern-based pruning (as per NVIDIA A100)

  • Pattern-based Pruning: N:M sparsity (p.218)
  • N:M sparsity means that in each contiguous M elements, N of them is pruned (p.218)
  • 2:4 sparsity 50% sparsity
  • out of every 4, only 2 of them are non zero
    • benefit can condense them into 4 elements
    • cost need to store the index of where the element exists within
      • since 4 choices for every 4 positions, given 8 positions need 2 x 2 bit values to represent the indices
  • NVIDIA’s Ampere GPU Architecture, delivers up to 2x speed up (p.218)

Channel Pruning

  • easiest to accelerate without any accelerator e.g. phone or rpi
  • but tradeoff is can only prune less (e.g. 30%)
  • this is widely used in the industry
    • very easy to accelerate
  • Pro: Direct speed up due to reduced channel numbers (p.223)
  • Con: smaller compression ratio (p.223)

Pruning Criterion

  • What synapses and neurons should we prune? (p.225)
  • The less important the parameters being removed are, the better the performance of pruned neural network is. (p.226)
Magnitude-based Pruning (p.227)
  • considers weights with larger absolute values are more important than other weights (p.227)
  • simplest method that still works quite well! just prune the small valued weights at some threshold
Scaling-based Pruning (p.231)
  • learn the scaling factor (scaling factors are trainable parameters - p.232)
  • when doing the learning, encourage scaling factor to be as close to zero as possible using something like L2 / L1 regularisation
  • when doing pruning, absolute value of scaling factor — if small then prune the channels
Second-Order-based Pruning (p.234)
  • as per Optimal Brain Damage LeCun et al., NeurIPS 1989 (p.239)
Percentage-of-Zero-Based Pruning (p.241)
    • ReLU activation will generate zeros in the output activation (p.242)
    • Average Percentage of Zero activations (APoZ) can be exploited to measure the importance of the neurons (p.242)
  • for channel 0
    • see first image in the batch
      • 5 instance where activation 0
    • see second image
      • 6 instance where activation zero
    • divide by total number of elements ()
      • 4x4 for first image in the batch
      • 4x4 for second image in the batch
    • hence get
  • then do same for other channels
  • then prune the channel that is the largest in this case channel 2!
Regression-based Pruning (p.244)
  • Minimize reconstruction error of the corresponding layer’s outputs (p.244)
  • Channel Pruning for Accelerating Very Deep Neural Networks, He et al., ICCV 2017 (p.249)
  • previously talk about loss input to output, final loss after the entire NN
  • here is examine one individual layer
  • useful for LLMs, since no need to do a full forward pass, no need reconstruct weights via backprop
  • so try to do 1 layer instead
  • prune along the input channels , not the batch
    • there is opportunity to prune in the batch dimension as well, see later chapter
  • how to determine which channels to prune? (the red columns in )