EfficientDL - 3. Pruning and Sparsity (Part I)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Pruning

MLPerf (the Olympic Game for AI Computing) (p.180)

Key techniques: pruning, distillation, quantization (p.181)
QAT + Pruning + Distillation ⇒ biggest speedup for the MLPerf results

neural network pruning which can:

reduce the parameter counts of neural networks by more than 90%,
decreasing storage requirements,
improving computation efficiency of neural networks (p.182)

per energy consumption slide previously, want to reduce memory referencing as much as possible ⇒ want to reduce the weights and activations to save the memory reference ⇒ which saves the dram accessing ⇒ which saves the battery

normally will be starting with a trained model first, then prune and finetune iteratively
if start from untrained model to prune, will be challenging

Pruning Ratio

if start directly with a very high pruning ratio, rest of the parameters will be difficult to recover the accuracy
- if do 50 % > 30 % iteratively, it will be better to do from 50 % > 40 % > 30 % instead to gradually prune
larger the model to begin with, can potentially have a larger pruning ratio

Pruning Formulation

p.198

Pruning Granularity

different granularities, from structured to non-structured (p.204)

Structural Pruning, Structural Pruning vs Unstructured Pruning

structured better hardware pruning efficiency, but prune less
unstructured worse hardware efficiency, but most flexible pruning choice p.207
weights of convolutional layers have 4 dimensions $c_{o}, c_{i}, k_{h}, k_{w}$ (p.208)
4 dimensions give us more choices to select pruning granularities (p.208)
fine-grained ⇒ Usually larger compression ratio since we can flexibly find “redundant” weights (p.213)

Pattern-based pruning (as per NVIDIA A100)

Pattern-based Pruning: N:M sparsity (p.218)
N:M sparsity means that in each contiguous M elements, N of them is pruned (p.218)
2:4 sparsity ⇒ 50% sparsity
out of every 4, only 2 of them are non zero
- benefit ⇒ can condense them into 4 elements
- cost ⇒ need to store the index of where the element exists within
  - since 4 choices for every 4 positions, given 8 positions → need 2 x 2 bit values to represent the indices
NVIDIA’s Ampere GPU Architecture, delivers up to 2x speed up (p.218)

Channel Pruning

easiest to accelerate without any accelerator ⇒ e.g. phone or rpi
but tradeoff is can only prune less (e.g. 30%)
this is widely used in the industry
- very easy to accelerate
Pro: Direct speed up due to reduced channel numbers (p.223)
Con: smaller compression ratio (p.223)

Pruning Criterion

What synapses and neurons should we prune? (p.225)
The less important the parameters being removed are, the better the performance of pruned neural network is. (p.226)

Magnitude-based Pruning (p.227)

considers weights with larger absolute values are more important than other weights (p.227)
simplest method that still works quite well! just prune the small valued weights at some threshold

Scaling-based Pruning (p.231)

learn the scaling factor (scaling factors are trainable parameters - p.232)
when doing the learning, encourage scaling factor to be as close to zero as possible using something like L2 / L1 regularisation
when doing pruning, absolute value of scaling factor — if small then prune the channels

Second-Order-based Pruning (p.234)

as per Optimal Brain Damage LeCun et al., NeurIPS 1989 (p.239)

Percentage-of-Zero-Based Pruning (p.241)

- ReLU activation will generate zeros in the output activation (p.242)
- Average Percentage of Zero activations (APoZ) can be exploited to measure the importance of the neurons (p.242)
for channel 0
- see first image in the batch
  - 5 instance where activation 0
- see second image
  - 6 instance where activation zero
- divide by total number of elements ( $2 \cdot 4 \cdot 4$ )
  - 4x4 for first image in the batch
  - 4x4 for second image in the batch
- hence get $\frac{5 + 6}{2 \cdot 4 \cdot 4} = \frac{11}{32}$
then do same for other channels
then prune the channel that is the largest ⇒ in this case channel 2!

Regression-based Pruning (p.244)

Minimize reconstruction error of the corresponding layer’s outputs (p.244)
Channel Pruning for Accelerating Very Deep Neural Networks, He et al., ICCV 2017 (p.249)
previously talk about loss → input to output, final loss after the entire NN
here is examine one individual layer
useful for LLMs, since no need to do a full forward pass, no need reconstruct weights via backprop
so try to do 1 layer instead
prune along the input channels $c_{i}$ , not the batch $b$
- there is opportunity to prune in the batch dimension $b$ as well, see later chapter
how to determine which channels to prune? (the red columns in $c_{i}$ )

Darius Knowledge Hub

Explorer

EfficientDL - 3. Pruning and Sparsity (Part I)

EfficientDL - 3. Pruning and Sparsity (Part I)

Pruning Ratio

Pruning Formulation

Pruning Granularity

Pattern-based pruning (as per NVIDIA A100)

Channel Pruning

Pruning Criterion

Magnitude-based Pruning (p.227)

Scaling-based Pruning (p.231)

Second-Order-based Pruning (p.234)

Percentage-of-Zero-Based Pruning (p.241)

Regression-based Pruning (p.244)

Graph View

Backlinks