Pruning
Created: 16 Jan 2023, 03:58 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, TinyML
Overview
Related fields
Introduction
- Pruning or sparsification is a family of methods to produce sparse neural networks.
Structural Pruning vs Unstructured Pruning
Structured
- changes the structure of neural networks by physically removing grouped parameters (per DepGraph / Torch-Pruning paper)
- only store offsets of blocks or other elements arranged with a fixed structure need to be stored
- remove complete filters and directly lead to efficient deep neural models without requiring specialized hardware for sparse structures.
Unstructured
- conducts zeroing on partial weights without modification to the network structure (per DepGraph / Torch-Pruning paper)
- e.g. setting as zero see Rachit Singh - Deep learning model compression
- the offset for each single element needs to be encoded to be stored
- mask individual weights in the network
- requires specialized hardware for sparse structures.
Model vs Ephemeral Pruning
- Model
- sparsification or pruning permanently applied to the model
- Ephemeral
- sparsification or pruning temporarily applied during computation

From Pruning neural networks without any data by iteratively conserving synaptic flow:
Pruning after training
- Conventional pruning algorithms assign scores to parameters in neural networks after training and remove the parameters with the lowest scores [5, 23, 24].
- Popular scoring metrics include weight magnitudes [4, 6], its generalization to multi-layers [25], first- [1, 26, 27, 28] and second-order [2, 3, 28] Taylor coefficients of the training loss with respect to the parameters, and more sophisticated variants [29, 30, 31].
- While these pruning algorithms can indeed compress neural networks at test time, there is no reduction in the cost of training.
Pruning before training
- Recent works demonstrated that randomly initialized neural networks can be pruned before training with little or no loss in the final test accuracy [10, 13, 32].
- In particular, the Iterative Magnitude Pruning (IMP) algorithm [10, 11] repeats multiple cycles of training, pruning, and weight rewinding to identify extremely sparse neural networks at initialization that can be trained to match the test accuracy of the original network. While IMP is powerful, it requires multiple cycles of expensive training and pruning with very specific sets of hyperparameters.
- Avoiding these difficulties, a different approach uses the gradients of the training loss at initialization to prune the network in a single-shot [13, 14].
- While these single-shot pruning algorithms at initialization are much more efficient, and work as well as IMP at moderate levels of sparsity, they suffer from layer-collapse, or the premature pruning of an entire layer rendering a network untrainable [33, 34].
- Understanding and circumventing this layer-collapse issue is the fundamental motivation for our study.
[10] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis - Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
Theoretical References
Papers
- Sparsity in Deep Learning - Pruning and Growth for Efficient Inference and Training in Neural Networks - 2021 survey paper
- Wanda - A Simple and Effective Pruning Approach for Large Language Models
- approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis
- requires no retraining or weight update, and the pruned LLM can be used as is
- outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update
- To read!
- Rethinking the Value of Network Pruning
- shows that for structured pruning, training the pruned model from scratch can almost always achieve comparable or higher level of accuracy than the model obtained from the typical “training, pruning and fine-tuning” (Fig. 1) procedure
Articles
- Rachit Singh - Deep learning model compression
- Large Transformer Model Inference Optimization | Lil’Log
- article on optimization including pruning
Courses
Code References
Methods
Tools, Frameworks
- GitHub - VainF/Torch-Pruning: [CVPR-2023] Towards Any Structural Pruning; LLMs / Diffusion / YOLOv8 / CNNs / Transformers
- this is by NUS! Wang Xinchao student!
- https://openaccess.thecvf.com/content/CVPR2023/papers/Fang_DepGraph_Towards_Any_Structural_Pruning_CVPR_2023_paper.pdf
- GitHub - JJGO/shrinkbench: PyTorch library to facilitate development and standardized evaluation of neural network pruning methods.