Sparsity in Deep Learning - Pruning and Growth for Efficient Inference and Training in Neural Networks

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags:

Annotations

background on mathematical methods in sparsification, * show annotation

early structure adaptation * show annotation

relations between sparsity and the training process * show annotation

techniques for achieving acceleration on real hardware * show annotation

over-parameterized models are easier to train with stochasticgradient descent (SGD) than more compact representations * show annotation

eep learning models are traditionally dense and over-parameterized, sometimes tothe extent that they can memorize random patterns in data * show annotation

over-parameterization comes at the cost of additional memory and computation effort duringmodel training and inference * show annotation

found thatsparsity can improve robustness against adversarial attacks * show annotation

investigate sparsity during the training process to manage thecosts of training * show annotation

which elements of a neural network are sparsified,when are they sparsified, and how can they be sparsified * show annotation

consider sparse trainingand the need to re-add connections during training to maintain a constant model complexity aftersparsification * show annotation

nearly every basic approach has been invented at least twice * show annotation

six main techniques * show annotation

Down-sizing models creates smaller dense networks to solve the same task * show annotation

Operator factorization decomposes operators, for example the matrix multiplication ofdense layers, into smaller operators. * show annotation

Value quantization seeks to find a good low-precision encoding for values in the networks,such as weights, activations, or gradients. * show annotation

Value Compression can be used to compress model structures and values (e.g., weights) * show annotation

Parameter sharing can lead to model compression by exploiting redundancy in the param-eter space * show annotation

Sparsification can lead to more efficient models that continue to operate in high-dimensionalfeature spaces but reduce the representational complexity using only a subset of the dimen-sions at a time. * show annotation

parametersharing, can also reduce the computational complexity * show annotation

. In this paper, we focus on themost complex and, in our view, most powerful of those techniques: sparsification, also known as“pruning” in some contexts. * show annotation

Model Compression Techniques * show annotation

quick executive overview of the field, then we recommend studyingSections 2 and 8 while skimming Sections 3, 4, 5, and 7, especially the overview figures and tablestherein. * show annotation

parsification best practices, we recommend the executiveoverview in combination with details in Section 6 and the references therein. * show annotation

utility of sparsification * show annotation

(1) improved generalization and ro-bustness and (2) improved performance for inference and/or training * show annotation

differentiate between model (also structural) and ephemeral sparsifica-tion * show annotation

ardware engineering aspects, then we recommend to atleast get the executive overview mentioned before and study Section 7 in detai * show annotation

Model sparsification changes the model but does not change the sparsity pattern across multipleinference or forward passes. * show annotation

If we sparsify arbitrary weights, the resultingmodel may be unstructured and we may need to remember indices as described before. This addsoverheads for index structures and leads to less efficient execution on hardware that is optimizedfor dense computations * show annotation

Ephemeral sparsification is a second class of sparsification approaches—it is applied during thecalculation of each example individually and only relevant for this example. * show annotation

ephemeral sparsity is dynamically updated for each example and configured with a smallnumber of parameters during inference and training, model sparsity follows a more complexNAS-like procedure. Model sparsity is thus often trained with a schedule. We differentiate threedifferent classes of training schedules illustrated in Fig. 7. * show annotation

2.4.1 Sparsify after training. The train-then-sparsify * show annotation

2.4.2 Sparsify during training. The sparsify-during-training * show annotation

2.4.3 Sparse training. The fully-sparse training * show annotation

since we are starting from a dense model, training does not change such that existinghyperparameter settings and learning schedules can be re-used * show annotation

this approachneeds to hold the dense model in memory at the beginning of the operation and thus does notenable the use of smaller-capacity devices * show annotation

Instead of deleting pruned weights and gradients, they use binarymasks to determine the presence or absence of weights and update even masked weights duringbackpropagation to enable better weight regrowth/selection ( * show annotation

Means that normal way is to delete? Delete means really delete or set to zero?

Vs masking?

Iterative hard thresholding (IHT), is a technique where training schedules ofdense and sparse iterations are combine * show annotation

Different from iterative magnitude pruning (lottery ticket hypothesis)

From The lottery ticket hypothesis - Finding sparse, trainable neural networks under “After Training”: “Han et al. (2017) and Jin et al. (2016) restore pruned connections to increase network capacity after small weights have been pruned and surviving weights fine-tuned.”

Han et al. [2017]use a similar scheme where they run three steps during training: (1) (traditional) dense trainingto convergence, (2) magnitude-pruning followed by retraining, and (3) dense training * show annotation

DSD - Dense Sparse Dense

From The lottery ticket hypothesis - Finding sparse, trainable neural networks under “After Training”: “Han et al. (2017) and Jin et al. (2016) restore pruned connections to increase network capacity after small weights have been pruned and surviving weights fine-tuned.”

magnitude-based pruning (see Section 3.2 * show annotation

starts with a sparse model and trainsin the sparse regime by removing and adding elements during the training process. * show annotation

differentiate between static and dynamic sparsity during sparse training. * show annotation

Dynamic sparsity during training. * show annotation

schemes that iteratively prune and add (regrow)elements during the training phase * show annotation

Fixed sparsity during training * show annotation

trained with a fixed sparsity structuredetermined before training starts. * show annotation

2.4.4 Ephemeral sparsity during training * show annotation

efficient training methods would take advantageof both ephemeral and model sparsity during training * show annotation

.4.6 General Sparse Deep Learning Schedules * show annotation

use-case for sparsification is to enable ensemble models with a limited parameterand compute budget * show annotation

Structured sparsity constrains sparsitypatterns in the weights such that they can be described with low-overhead representations such asstrides or blocks. * show annotation

directly remove the parameter / node / thing that is pruned

unstructured weight sparsity requires storing the offsetsof non-zero elements and handling the structure explicitly during processing. * show annotation

using a mask to get the non-zero elements (i.e. zeros still in the matrix)

Magnitude pruning * show annotation

removing weights with thesmallest absolute magnitude * show annotation

how to choose the magnitude 𝑥 below which to prune. * show annotation

fixing aweight budget and keeping the top-𝑘 weights globally or per layer, one could learn sparsificationthresholds per layer * show annotation

Han et al. [2016b] popularized magnitude pruning for modern deep neural networks as part ofneural network compression for inference. * show annotation

In unstructured pruning, the popular paper on model compressionby Han et al. [2016b] combines magnitude-based sparsification, quantization, weight sharing, andHuffman coding into a compression scheme able to reduce AlexNet and VGG-16 on the ImageNetdataset by 35×and 49×, respectively, without loss of accuracy. * show annotation

training asparse model is more prone to converge to suboptimal local minima than a dense network * show annotation

Sparse networks do not always execute faster than dense networks using current machine learningframeworks on today’s hardware. * show annotation

scientific computing kernelssuch as the sparse BLAS or cuSPARSE are only optimized for scientific computing workloads andsupported formats aimed at high sparsities such as compressed sparse row * show annotation

whereas sparsity in DL are normally not that highly sparse

completely unstructured storage where the offsetfor each single element needs to be encode * show annotation

structured storage formats that only store offsets ofblocks or other elements arranged with a fixed structure. * show annotation

why can networks be pruned and what is the best pruning methodol-ogy remain as open questions * show annotation

pruning is most efficient for architectures that are overparam-eterized * show annotation

should always consider the degree of over-parameterization or what we call the “parameterefficiency” * show annotation

Lottery Ticket Hypothesis” * show annotation

The lottery ticket hypothesis - Finding sparse, trainable neural networks

“Iterative Magnitude Pruning * show annotation

identified by Blalock et al. [2020] who propose a standard methodology together with a set ofbenchmarks to solve this issue * show annotation

https://github.com/JJGO/shrinkbench

toy examples, the MNIST dataset with the LeNet-300-100 and LeNet-5 networks can act asa good calibration * show annotation

state of the art is above 98% accuracy with less than 1% of the originalparameters * show annotation

global magnitude pruning is agood baseline method for a wide range of scenarios, see e.g., Singh and Alistarh [2020] for results. * show annotation

breakthrough results in the area of efficient convolutional neural networks can be seen as manuallydefined sparsifiers, such as bottleneck layers or depthwise separable convolutions * show annotation

Newer works on transformers suggest the moreautomated way of “train big and then prune” * show annotation

unstructured iterative magnitude pruning of Zhu and Gupta [2017] onCNNs for image classification results in a large degradation in accuracy for a small number of classesin tasks such as ImageNet, compared to the model’s overall decrease. * show annotation

quantization results in a much smaller impact to different classes * show annotation

prunedmodels are significantly more brittle under distribution shifts, such as corrupted images * show annotation

ncreased errors on certain classescaused by pruning can amplify existing algorithmic biases. * show annotation

pruning increaseserrors on underrepresented subgroups * show annotation

Therefore, it is importantto study the finer-grained impacts of pruning, rather than just the overall accuracy * show annotation

flurry of simple approaches enables reaching moderate sparsity levels (e.g., 50–90%) at the sameor even increased accuracy * show annotation

reaching higher sparsitylevels (e.g., >95%) requires more elaborate pruning techniques where we may be reaching thelimit of gradient-based optimization techniques for learning. * show annotation

Structured pruning seems to provide a great tradeoff between accuracy and perfor-mance on today’s architectures * show annotation

Darius Knowledge Hub

Explorer

Sparsity in Deep Learning - Pruning and Growth for Efficient Inference and Training in Neural Networks

Sparsity in Deep Learning - Pruning and Growth for Efficient Inference and Training in Neural Networks

Annotations

Graph View

Table of Contents

Backlinks