EfficientDL - 3. Pruning and Sparsity (Part I)
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
MLPerf (the Olympic Game for AI Computing) (p.180)
- Key techniques: pruning, distillation, quantization (p.181)
-34.png)
- QAT + Pruning + Distillation ⇒ biggest speedup for the MLPerf results
neural network pruning which can:
- reduce the parameter counts of neural networks by more than 90%,
- decreasing storage requirements,
- improving computation efficiency of neural networks (p.182)
per energy consumption slide previously, want to reduce memory referencing as much as possible ⇒ want to reduce the weights and activations to save the memory reference ⇒ which saves the dram accessing ⇒ which saves the battery
- normally will be starting with a trained model first, then prune and finetune iteratively
- if start from untrained model to prune, will be challenging
Pruning Ratio
- if start directly with a very high pruning ratio, rest of the parameters will be difficult to recover the accuracy
- if do 50 % > 30 % iteratively, it will be better to do from 50 % > 40 % > 30 % instead to gradually prune
- larger the model to begin with, can potentially have a larger pruning ratio
Pruning Formulation
Pruning Granularity
- different granularities, from structured to non-structured (p.204)
Structural Pruning, Structural Pruning vs Unstructured Pruning
-
structured better hardware pruning efficiency, but prune less
-
unstructured worse hardware efficiency, but most flexible pruning choice
p.207 -
weights of convolutional layers have 4 dimensions (p.208)
-
4 dimensions give us more choices to select pruning granularities (p.208)
-38.png)
-
fine-grained ⇒ Usually larger compression ratio since we can flexibly find “redundant” weights (p.213)
-39.png)
Pattern-based pruning (as per NVIDIA A100)
- Pattern-based Pruning: N:M sparsity (p.218)
- N:M sparsity means that in each contiguous M elements, N of them is pruned (p.218)
-40.png)
- 2:4 sparsity ⇒ 50% sparsity
- out of every 4, only 2 of them are non zero
- benefit ⇒ can condense them into 4 elements
- cost ⇒ need to store the index of where the element exists within
- since 4 choices for every 4 positions, given 8 positions → need 2 x 2 bit values to represent the indices
- NVIDIA’s Ampere GPU Architecture, delivers up to 2x speed up (p.218)
Channel Pruning
- easiest to accelerate without any accelerator ⇒ e.g. phone or rpi
- but tradeoff is can only prune less (e.g. 30%)
- this is widely used in the industry
- very easy to accelerate
-41.png)
- Pro: Direct speed up due to reduced channel numbers (p.223)
- Con: smaller compression ratio (p.223)
Pruning Criterion
- What synapses and neurons should we prune? (p.225)
- The less important the parameters being removed are, the better the performance of pruned neural network is. (p.226)
Magnitude-based Pruning (p.227)
- considers weights with larger absolute values are more important than other weights (p.227)
- simplest method that still works quite well! just prune the small valued weights at some threshold
-45.png)
Scaling-based Pruning (p.231)
- learn the scaling factor (scaling factors are trainable parameters - p.232)
- when doing the learning, encourage scaling factor to be as close to zero as possible using something like L2 / L1 regularisation
- when doing pruning, absolute value of scaling factor — if small then prune the channels
-47.png)
Second-Order-based Pruning (p.234)
- as per Optimal Brain Damage LeCun et al., NeurIPS 1989 (p.239)
-49.png)
Percentage-of-Zero-Based Pruning (p.241)
-
- ReLU activation will generate zeros in the output activation (p.242)
-
- Average Percentage of Zero activations (APoZ) can be exploited to measure the importance of the neurons (p.242)
-50.png)
- Average Percentage of Zero activations (APoZ) can be exploited to measure the importance of the neurons (p.242)
- for channel 0
- see first image in the batch
- 5 instance where activation 0
- see second image
- 6 instance where activation zero
- divide by total number of elements ()
- 4x4 for first image in the batch
- 4x4 for second image in the batch
- hence get
- see first image in the batch
- then do same for other channels
- then prune the channel that is the largest ⇒ in this case channel 2!
Regression-based Pruning (p.244)
- Minimize reconstruction error of the corresponding layer’s outputs (p.244)
- Channel Pruning for Accelerating Very Deep Neural Networks, He et al., ICCV 2017 (p.249)
- previously talk about loss → input to output, final loss after the entire NN
- here is examine one individual layer
- useful for LLMs, since no need to do a full forward pass, no need reconstruct weights via backprop
- so try to do 1 layer instead
-51.png)
- prune along the input channels , not the batch
- there is opportunity to prune in the batch dimension as well, see later chapter
- how to determine which channels to prune? (the red columns in )
-35.png)
-36.png)
-52.png)