EfficientDL - 4. Pruning and Sparsity (Part II)
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Pruning Ratio (p.261)
How should we find per-layer pruning ratios (p.261)
- sensitivity of each layer!
- some layers, prune abit, acc drop alot
- some layers e.g. fully connected layer / mlp, prune alot, acc stay the same ⇒ choose this!
Sensitivity Analysis
to determine the per-layer pruning ratio (p.265)
- so in this case the green line (layer 1) shows the least dip in acc even at high pruning ratio of 90%
- so it seems like we can prune this!
- but one catch is that we are assuming the layers are indep of each other, not interfering with each other
- why this is done, is a tradeoff of gpu hours and how many points to evaluate for gpu accuracy
- does this mean that we should prune layer 1?
- not necessarily, we don't know the actual size of each layer here - the green curve could be a small layer, even if prune 80% it could be very small number only
- we need to consider the actual size of each layer too ⇒ AMC / NetAdapt
- for newer LLMs, might be abit more uniform in each layer size vs CNNs
- want to make it a one-button solution to automatically do pruning
- specify target model size, then return the smaller model
AMC: AutoML for Model Compression (p.281)
(p.281) Pruning as a reinforcement learning problem For ResNet-50 (p.285) RL agent automatically learns 3x3 convolutions have more redundancy and can be pruned more (p.285) RL agent automatically learns 1x1 convolutions have less redundancy and can be pruned less
- latest 2023 0.8ms latency for resnet!
- why 0.75 mobilenet (prune evenly across layers) ⇒ 70% speedup?
- FLOPs calculation for conv layer - 6 terms height, width, size of conv kernel, input channel, output channel ⇒ since input channel and output channel both is 0.75x size, its a quadratic relationship
NetAdapt (p.287)
(p.287) A rule-based iterative/progressive method
Fine-tuning / Training (p.296)
(p.296) How should we improve performance of sparse models?
(p.297)
- Fine-tuning the pruned neural networks will help recover the accuracy and push the pruning ratio higher.
- Learning rate for fine-tuning is usually 1/100 or 1/10 of the original learning rate.
Iterative Pruning (p.305)
- Prune ⇒ Finetune = 1 iteration
Iterative pruning gradually increases the target sparsity in each iteration. (p.305) - e.g. 30% > 50% > 70%
- better than single-step aggressive pruning
- boost pruning ratio from 5x to 9x on AlexNet compared to single-step aggressive pruning (p.305)
- requires high engineering efforts, difficult to automate
- since need to restart pruning, sensitivity analysis, and pruning for each step
- for research / single model it is easy to do one off.
- but if there are some bypass layers (e.g. resnet)
- there will be some additional dependencies ⇒ since tensor plus another tensor - must be same size. cannot do elementwise add
- so need to trace the dependencies to ensure that it matches.
Regularization (p.306)
- add regularization to loss term so as to:
- (p.306)
• penalize non-zero parameters • encourage smaller parameters
The Lottery Ticket Hypothesis (p.308)
- currently, we have to train a model then prune it
- can we start with a sparse and pruned model and directly train the sparse model and beat the pruned model? (p.308)
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations. - The Lottery Ticket Hypothesis
see Iterative magnitude pruning

- train to converge, then prune based on magnitude — get the sparsity pattern
- inherit the sparsity pattern, random initialise the weights, and train it to convergence again
- for small datasets like CIFAR and MNIST it is still okay, but for large datasets like imagenet this is still challenging! ^78b8e6
- after 3 - 4 epochs of training + pruning it is enough to determine the sparsity pattern
- but you still need to go through the training
- active research area!
System Support for Sparsity (p.311)
- use coarse grained pruning at filter level - remove entire rows from the matrix
- can still use dense matrix multiplication libraries to accelerate
- but limit is lower pruning ratio ⇒ what kind of hardware is available for more fine grained pruning?
EIE: Weight Sparsity + Activation Sparsity for GEMM (p.313)
- all 3 can be stacked together ⇒ the speedup multiplies across all 3
Retrospective on EIE - 2023 retrospective paper on original paper in 2016
- Pros
- Cons
- isn’t as easily applied to arrays of vector processors — improve: structured sparsity (N:M sparsity) (p.336) ⇒ per new NVIDIA
- only support FC layers - actually reborn in LLM- (p.336)
- fits everything on SRAM - practical for TinyML, not LLM (p.336)
- TPU designed for 28MB of SRAM
- because the weight is 28m parameters
- commercial normally use SRAM to fit all the weights
(p.337) • Generative AI: spatial sparsity (SIGE, NeurIPS’22) • Transformer: token sparsity, progressive quantization (SpAtten, HPCA’21) • Video: temporal sparsity (TSM, ICCV’19) • Point cloud: spatial sparsity (TorchSparse, MLSys’22 & PointAcc, Micro’22)
(p.337) future AI models will be sparse at various granularity and structures
- MoE models (GPT4) - giant model each time only use a portion of the model, not all of them.
(p.337) Codesigned with specialized accelerators, sparse models will become more efficient and accessible.
NVIDIA Tensor Core: M:N Weight Sparsity Sparsity (p.338)
2:4 weight sparsity / M:N sparsity
- at most 2 non zero every 4 elements
TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution (p.344)
TorchSparse: Sparse Convolution Library (p.345)
- after convolution even though it was previously sparse in the input feature map, it will become dense!
- how to make it still sparse, with same sparsity pattern as input?

- still try to pad some with zeros to allow for parallelism, since if all irregular, difficult to parallel across gpu
- Separate computation (baseline) : many kernel calls, low device utilization (p.362)
- Dense convolution: best regularity but large computation overhead (p.363)
- Computation with grouping: balancing overhead and regularity (p.364)
PointAcc: Hardware Accelerator for Sparse Convolution (p.370)
- specific hardware for point cloud acceleration
Current Challenges
- LLM pruning still challenging in speedups due to finetuning challenges
- iterative magnitude pruning for large datasets like imagenet ^78b8e6