Sparsity in Deep Learning - Pruning and Growth for Efficient Inference and Training in Neural Networks
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags:
Annotations
background on mathematical methods in sparsification, * show annotation
early structure adaptation * show annotation
relations between sparsity and the training process * show annotation
techniques for achieving acceleration on real hardware * show annotation
over-parameterized models are easier to train with stochasticgradient descent (SGD) than more compact representations * show annotation
eep learning models are traditionally dense and over-parameterized, sometimes tothe extent that they can memorize random patterns in data * show annotation
over-parameterization comes at the cost of additional memory and computation effort duringmodel training and inference * show annotation
found thatsparsity can improve robustness against adversarial attacks * show annotation
investigate sparsity during the training process to manage thecosts of training * show annotation
which elements of a neural network are sparsified,when are they sparsified, and how can they be sparsified * show annotation
consider sparse trainingand the need to re-add connections during training to maintain a constant model complexity aftersparsification * show annotation
nearly every basic approach has been invented at least twice * show annotation
six main techniques * show annotation
Down-sizing models creates smaller dense networks to solve the same task * show annotation
Operator factorization decomposes operators, for example the matrix multiplication ofdense layers, into smaller operators. * show annotation
Value quantization seeks to find a good low-precision encoding for values in the networks,such as weights, activations, or gradients. * show annotation
Value Compression can be used to compress model structures and values (e.g., weights) * show annotation
Parameter sharing can lead to model compression by exploiting redundancy in the param-eter space * show annotation
Sparsification can lead to more efficient models that continue to operate in high-dimensionalfeature spaces but reduce the representational complexity using only a subset of the dimen-sions at a time. * show annotation
parametersharing, can also reduce the computational complexity * show annotation
. In this paper, we focus on themost complex and, in our view, most powerful of those techniques: sparsification, also known as“pruning” in some contexts. * show annotation
Model Compression Techniques * show annotation
quick executive overview of the field, then we recommend studyingSections 2 and 8 while skimming Sections 3, 4, 5, and 7, especially the overview figures and tablestherein. * show annotation
parsification best practices, we recommend the executiveoverview in combination with details in Section 6 and the references therein. * show annotation
utility of sparsification * show annotation
(1) improved generalization and ro-bustness and (2) improved performance for inference and/or training * show annotation
differentiate between model (also structural) and ephemeral sparsifica-tion * show annotation
ardware engineering aspects, then we recommend to atleast get the executive overview mentioned before and study Section 7 in detai * show annotation
Model sparsification changes the model but does not change the sparsity pattern across multipleinference or forward passes. * show annotation
If we sparsify arbitrary weights, the resultingmodel may be unstructured and we may need to remember indices as described before. This addsoverheads for index structures and leads to less efficient execution on hardware that is optimizedfor dense computations * show annotation
Ephemeral sparsification is a second class of sparsification approaches—it is applied during thecalculation of each example individually and only relevant for this example. * show annotation
ephemeral sparsity is dynamically updated for each example and configured with a smallnumber of parameters during inference and training, model sparsity follows a more complexNAS-like procedure. Model sparsity is thus often trained with a schedule. We differentiate threedifferent classes of training schedules illustrated in Fig. 7. * show annotation
2.4.1 Sparsify after training. The train-then-sparsify * show annotation
2.4.2 Sparsify during training. The sparsify-during-training * show annotation
2.4.3 Sparse training. The fully-sparse training * show annotation
since we are starting from a dense model, training does not change such that existinghyperparameter settings and learning schedules can be re-used * show annotation
this approachneeds to hold the dense model in memory at the beginning of the operation and thus does notenable the use of smaller-capacity devices * show annotation
- Instead of deleting pruned weights and gradients, they use binarymasks to determine the presence or absence of weights and update even masked weights duringbackpropagation to enable better weight regrowth/selection ( * show annotation
Means that normal way is to delete? Delete means really delete or set to zero?
Vs masking?
- Iterative hard thresholding (IHT), is a technique where training schedules ofdense and sparse iterations are combine * show annotation
Different from iterative magnitude pruning (lottery ticket hypothesis)
From The lottery ticket hypothesis - Finding sparse, trainable neural networks under “After Training”: “Han et al. (2017) and Jin et al. (2016) restore pruned connections to increase network capacity after small weights have been pruned and surviving weights fine-tuned.”
Han et al. [2017]use a similar scheme where they run three steps during training: (1) (traditional) dense trainingto convergence, (2) magnitude-pruning followed by retraining, and (3) dense training * show annotation
DSD - Dense Sparse Dense
From The lottery ticket hypothesis - Finding sparse, trainable neural networks under “After Training”: “Han et al. (2017) and Jin et al. (2016) restore pruned connections to increase network capacity after small weights have been pruned and surviving weights fine-tuned.”
magnitude-based pruning (see Section 3.2 * show annotation
starts with a sparse model and trainsin the sparse regime by removing and adding elements during the training process. * show annotation
differentiate between static and dynamic sparsity during sparse training. * show annotation
Dynamic sparsity during training. * show annotation
schemes that iteratively prune and add (regrow)elements during the training phase * show annotation
Fixed sparsity during training * show annotation
trained with a fixed sparsity structuredetermined before training starts. * show annotation
2.4.4 Ephemeral sparsity during training * show annotation
efficient training methods would take advantageof both ephemeral and model sparsity during training * show annotation
.4.6 General Sparse Deep Learning Schedules * show annotation
use-case for sparsification is to enable ensemble models with a limited parameterand compute budget * show annotation
- Structured sparsity constrains sparsitypatterns in the weights such that they can be described with low-overhead representations such asstrides or blocks. * show annotation
directly remove the parameter / node / thing that is pruned
- unstructured weight sparsity requires storing the offsetsof non-zero elements and handling the structure explicitly during processing. * show annotation
using a mask to get the non-zero elements (i.e. zeros still in the matrix)
Magnitude pruning * show annotation
removing weights with thesmallest absolute magnitude * show annotation
how to choose the magnitude 𝑥 below which to prune. * show annotation
fixing aweight budget and keeping the top-𝑘 weights globally or per layer, one could learn sparsificationthresholds per layer * show annotation
Han et al. [2016b] popularized magnitude pruning for modern deep neural networks as part ofneural network compression for inference. * show annotation
In unstructured pruning, the popular paper on model compressionby Han et al. [2016b] combines magnitude-based sparsification, quantization, weight sharing, andHuffman coding into a compression scheme able to reduce AlexNet and VGG-16 on the ImageNetdataset by 35×and 49×, respectively, without loss of accuracy. * show annotation
training asparse model is more prone to converge to suboptimal local minima than a dense network * show annotation
Sparse networks do not always execute faster than dense networks using current machine learningframeworks on today’s hardware. * show annotation
- scientific computing kernelssuch as the sparse BLAS or cuSPARSE are only optimized for scientific computing workloads andsupported formats aimed at high sparsities such as compressed sparse row * show annotation
whereas sparsity in DL are normally not that highly sparse
completely unstructured storage where the offsetfor each single element needs to be encode * show annotation
structured storage formats that only store offsets ofblocks or other elements arranged with a fixed structure. * show annotation
why can networks be pruned and what is the best pruning methodol-ogy remain as open questions * show annotation
pruning is most efficient for architectures that are overparam-eterized * show annotation
should always consider the degree of over-parameterization or what we call the “parameterefficiency” * show annotation
- Lottery Ticket Hypothesis” * show annotation
The lottery ticket hypothesis - Finding sparse, trainable neural networks
“Iterative Magnitude Pruning * show annotation
- identified by Blalock et al. [2020] who propose a standard methodology together with a set ofbenchmarks to solve this issue * show annotation
toy examples, the MNIST dataset with the LeNet-300-100 and LeNet-5 networks can act asa good calibration * show annotation
state of the art is above 98% accuracy with less than 1% of the originalparameters * show annotation
global magnitude pruning is agood baseline method for a wide range of scenarios, see e.g., Singh and Alistarh [2020] for results. * show annotation
breakthrough results in the area of efficient convolutional neural networks can be seen as manuallydefined sparsifiers, such as bottleneck layers or depthwise separable convolutions * show annotation
Newer works on transformers suggest the moreautomated way of “train big and then prune” * show annotation
unstructured iterative magnitude pruning of Zhu and Gupta [2017] onCNNs for image classification results in a large degradation in accuracy for a small number of classesin tasks such as ImageNet, compared to the model’s overall decrease. * show annotation
quantization results in a much smaller impact to different classes * show annotation
prunedmodels are significantly more brittle under distribution shifts, such as corrupted images * show annotation
ncreased errors on certain classescaused by pruning can amplify existing algorithmic biases. * show annotation
pruning increaseserrors on underrepresented subgroups * show annotation
Therefore, it is importantto study the finer-grained impacts of pruning, rather than just the overall accuracy * show annotation
flurry of simple approaches enables reaching moderate sparsity levels (e.g., 50–90%) at the sameor even increased accuracy * show annotation
reaching higher sparsitylevels (e.g., >95%) requires more elaborate pruning techniques where we may be reaching thelimit of gradient-based optimization techniques for learning. * show annotation
Structured pruning seems to provide a great tradeoff between accuracy and perfor-mance on today’s architectures * show annotation