Sparsity in Deep Learning - Pruning and Growth for Efficient Inference and Training in Neural Networks


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags:

Annotations


  • over-parameterized models are easier to train with stochasticgradient descent (SGD) than more compact representations * show annotation

  • eep learning models are traditionally dense and over-parameterized, sometimes tothe extent that they can memorize random patterns in data * show annotation

  • over-parameterization comes at the cost of additional memory and computation effort duringmodel training and inference * show annotation

  • found thatsparsity can improve robustness against adversarial attacks * show annotation

  • investigate sparsity during the training process to manage thecosts of training * show annotation

  • which elements of a neural network are sparsified,when are they sparsified, and how can they be sparsified * show annotation

  • consider sparse trainingand the need to re-add connections during training to maintain a constant model complexity aftersparsification * show annotation

  • nearly every basic approach has been invented at least twice * show annotation

  • Down-sizing models creates smaller dense networks to solve the same task * show annotation

  • Operator factorization decomposes operators, for example the matrix multiplication ofdense layers, into smaller operators. * show annotation

  • Value quantization seeks to find a good low-precision encoding for values in the networks,such as weights, activations, or gradients. * show annotation

  • Value Compression can be used to compress model structures and values (e.g., weights) * show annotation

  • Parameter sharing can lead to model compression by exploiting redundancy in the param-eter space * show annotation

  • Sparsification can lead to more efficient models that continue to operate in high-dimensionalfeature spaces but reduce the representational complexity using only a subset of the dimen-sions at a time. * show annotation

  • parametersharing, can also reduce the computational complexity * show annotation

  • . In this paper, we focus on themost complex and, in our view, most powerful of those techniques: sparsification, also known as“pruning” in some contexts. * show annotation

  • quick executive overview of the field, then we recommend studyingSections 2 and 8 while skimming Sections 3, 4, 5, and 7, especially the overview figures and tablestherein. * show annotation

  • parsification best practices, we recommend the executiveoverview in combination with details in Section 6 and the references therein. * show annotation

  • (1) improved generalization and ro-bustness and (2) improved performance for inference and/or training * show annotation

  • differentiate between model (also structural) and ephemeral sparsifica-tion * show annotation

  • ardware engineering aspects, then we recommend to atleast get the executive overview mentioned before and study Section 7 in detai * show annotation

  • Model sparsification changes the model but does not change the sparsity pattern across multipleinference or forward passes. * show annotation

  • If we sparsify arbitrary weights, the resultingmodel may be unstructured and we may need to remember indices as described before. This addsoverheads for index structures and leads to less efficient execution on hardware that is optimizedfor dense computations * show annotation

  • Ephemeral sparsification is a second class of sparsification approaches—it is applied during thecalculation of each example individually and only relevant for this example. * show annotation

  • ephemeral sparsity is dynamically updated for each example and configured with a smallnumber of parameters during inference and training, model sparsity follows a more complexNAS-like procedure. Model sparsity is thus often trained with a schedule. We differentiate threedifferent classes of training schedules illustrated in Fig. 7. * show annotation

  • 2.4.2 Sparsify during training. The sparsify-during-training * show annotation

  • since we are starting from a dense model, training does not change such that existinghyperparameter settings and learning schedules can be re-used * show annotation

  • this approachneeds to hold the dense model in memory at the beginning of the operation and thus does notenable the use of smaller-capacity devices * show annotation

  • Instead of deleting pruned weights and gradients, they use binarymasks to determine the presence or absence of weights and update even masked weights duringbackpropagation to enable better weight regrowth/selection ( * show annotation

Means that normal way is to delete? Delete means really delete or set to zero?

Vs masking?

  • Iterative hard thresholding (IHT), is a technique where training schedules ofdense and sparse iterations are combine * show annotation

Different from iterative magnitude pruning (lottery ticket hypothesis)

From The lottery ticket hypothesis - Finding sparse, trainable neural networks under “After Training”: “Han et al. (2017) and Jin et al. (2016) restore pruned connections to increase network capacity after small weights have been pruned and surviving weights fine-tuned.”

  • Han et al. [2017]use a similar scheme where they run three steps during training: (1) (traditional) dense trainingto convergence, (2) magnitude-pruning followed by retraining, and (3) dense training * show annotation

DSD - Dense Sparse Dense

From The lottery ticket hypothesis - Finding sparse, trainable neural networks under “After Training”: “Han et al. (2017) and Jin et al. (2016) restore pruned connections to increase network capacity after small weights have been pruned and surviving weights fine-tuned.”

  • starts with a sparse model and trainsin the sparse regime by removing and adding elements during the training process. * show annotation

  • differentiate between static and dynamic sparsity during sparse training. * show annotation

  • schemes that iteratively prune and add (regrow)elements during the training phase * show annotation

  • trained with a fixed sparsity structuredetermined before training starts. * show annotation

  • efficient training methods would take advantageof both ephemeral and model sparsity during training * show annotation

  • use-case for sparsification is to enable ensemble models with a limited parameterand compute budget * show annotation

  • Structured sparsity constrains sparsitypatterns in the weights such that they can be described with low-overhead representations such asstrides or blocks. * show annotation

directly remove the parameter / node / thing that is pruned

  • unstructured weight sparsity requires storing the offsetsof non-zero elements and handling the structure explicitly during processing. * show annotation

using a mask to get the non-zero elements (i.e. zeros still in the matrix)

  • fixing aweight budget and keeping the top-𝑘 weights globally or per layer, one could learn sparsificationthresholds per layer * show annotation

  • Han et al. [2016b] popularized magnitude pruning for modern deep neural networks as part ofneural network compression for inference. * show annotation

  • In unstructured pruning, the popular paper on model compressionby Han et al. [2016b] combines magnitude-based sparsification, quantization, weight sharing, andHuffman coding into a compression scheme able to reduce AlexNet and VGG-16 on the ImageNetdataset by 35×and 49×, respectively, without loss of accuracy. * show annotation

  • training asparse model is more prone to converge to suboptimal local minima than a dense network * show annotation

  • Sparse networks do not always execute faster than dense networks using current machine learningframeworks on today’s hardware. * show annotation

  • scientific computing kernelssuch as the sparse BLAS or cuSPARSE are only optimized for scientific computing workloads andsupported formats aimed at high sparsities such as compressed sparse row * show annotation

whereas sparsity in DL are normally not that highly sparse

  • completely unstructured storage where the offsetfor each single element needs to be encode * show annotation

  • structured storage formats that only store offsets ofblocks or other elements arranged with a fixed structure. * show annotation

  • why can networks be pruned and what is the best pruning methodol-ogy remain as open questions * show annotation

  • pruning is most efficient for architectures that are overparam-eterized * show annotation

  • should always consider the degree of over-parameterization or what we call the “parameterefficiency” * show annotation

The lottery ticket hypothesis - Finding sparse, trainable neural networks

  • identified by Blalock et al. [2020] who propose a standard methodology together with a set ofbenchmarks to solve this issue * show annotation

https://github.com/JJGO/shrinkbench

  • toy examples, the MNIST dataset with the LeNet-300-100 and LeNet-5 networks can act asa good calibration * show annotation

  • state of the art is above 98% accuracy with less than 1% of the originalparameters * show annotation

  • global magnitude pruning is agood baseline method for a wide range of scenarios, see e.g., Singh and Alistarh [2020] for results. * show annotation

  • breakthrough results in the area of efficient convolutional neural networks can be seen as manuallydefined sparsifiers, such as bottleneck layers or depthwise separable convolutions * show annotation

  • Newer works on transformers suggest the moreautomated way of “train big and then prune” * show annotation

  • unstructured iterative magnitude pruning of Zhu and Gupta [2017] onCNNs for image classification results in a large degradation in accuracy for a small number of classesin tasks such as ImageNet, compared to the model’s overall decrease. * show annotation

  • quantization results in a much smaller impact to different classes * show annotation

  • prunedmodels are significantly more brittle under distribution shifts, such as corrupted images * show annotation

  • ncreased errors on certain classescaused by pruning can amplify existing algorithmic biases. * show annotation

  • Therefore, it is importantto study the finer-grained impacts of pruning, rather than just the overall accuracy * show annotation

  • flurry of simple approaches enables reaching moderate sparsity levels (e.g., 50–90%) at the sameor even increased accuracy * show annotation

  • reaching higher sparsitylevels (e.g., >95%) requires more elaborate pruning techniques where we may be reaching thelimit of gradient-based optimization techniques for learning. * show annotation

  • Structured pruning seems to provide a great tradeoff between accuracy and perfor-mance on today’s architectures * show annotation