EfficientDL - 4. Pruning and Sparsity (Part II)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Pruning

Pruning Ratio (p.261)

How should we find per-layer pruning ratios (p.261)

sensitivity of each layer!
- some layers, prune abit, acc drop alot
- some layers e.g. fully connected layer / mlp, prune alot, acc stay the same ⇒ choose this!

need different pruning ratios for each layer since different layers have different sensitivity (p.265)

Some layers are more sensitive (e.g., first layer) (p.265)

Sensitivity Analysis

to determine the per-layer pruning ratio (p.265)

so in this case the green line (layer 1) shows the least dip in acc even at high pruning ratio of 90%
- so it seems like we can prune this!
- but one catch is that we are assuming the layers are indep of each other, not interfering with each other
- why this is done, is a tradeoff of gpu hours and how many points to evaluate for gpu accuracy
- does this mean that we should prune layer 1?
  - not necessarily, we don't know the actual size of each layer here - the green curve could be a small layer, even if prune 80% it could be very small number only
  - we need to consider the actual size of each layer too ⇒ AMC / NetAdapt
- for newer LLMs, might be abit more uniform in each layer size vs CNNs
want to make it a one-button solution to automatically do pruning
specify target model size, then return the smaller model

AMC: AutoML for Model Compression (p.281)

(p.281) Pruning as a reinforcement learning problem For ResNet-50 (p.285) RL agent automatically learns 3x3 convolutions have more redundancy and can be pruned more (p.285) RL agent automatically learns 1x1 convolutions have less redundancy and can be pruned less

latest 2023 0.8ms latency for resnet!
why 0.75 mobilenet (prune evenly across layers) ⇒ 70% speedup?
- FLOPs calculation for conv layer - 6 terms height, width, size of conv kernel, input channel, output channel ⇒ since input channel and output channel both is 0.75x size, its a quadratic relationship

NetAdapt (p.287)

(p.287) A rule-based iterative/progressive method

find a per-layer pruning ratio to meet a global resource constraint (e.g., latency, energy, …) (p.287)

process is done iteratively (p.287)

Fine-tuning / Training (p.296)

(p.296) How should we improve performance of sparse models?

(p.297)

Fine-tuning the pruned neural networks will help recover the accuracy and push the pruning ratio higher.

Learning rate for fine-tuning is usually 1/100 or 1/10 of the original learning rate.

Iterative Pruning (p.305)

Prune ⇒ Finetune = 1 iteration

Iterative pruning gradually increases the target sparsity in each iteration. (p.305) - e.g. 30% > 50% > 70%

better than single-step aggressive pruning
- boost pruning ratio from 5x to 9x on AlexNet compared to single-step aggressive pruning (p.305)
requires high engineering efforts, difficult to automate
- since need to restart pruning, sensitivity analysis, and pruning for each step
for research / single model it is easy to do one off.
but if there are some bypass layers (e.g. resnet)
- there will be some additional dependencies ⇒ since tensor plus another tensor - must be same size. cannot do elementwise add
- so need to trace the dependencies to ensure that it matches.

Regularization (p.306)

add regularization to loss term so as to:
(p.306)

• penalize non-zero parameters • encourage smaller parameters

The Lottery Ticket Hypothesis (p.308)

currently, we have to train a model then prune it
can we start with a sparse and pruned model and directly train the sparse model and beat the pruned model? (p.308)

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations. - The Lottery Ticket Hypothesis

see Iterative magnitude pruning

train to converge, then prune based on magnitude — get the sparsity pattern
inherit the sparsity pattern, random initialise the weights, and train it to convergence again
for small datasets like CIFAR and MNIST it is still okay, but for large datasets like imagenet this is still challenging! ^78b8e6
- after 3 - 4 epochs of training + pruning it is enough to determine the sparsity pattern
- but you still need to go through the training
- active research area!

System Support for Sparsity (p.311)

use coarse grained pruning at filter level - remove entire rows from the matrix
- can still use dense matrix multiplication libraries to accelerate
- but limit is lower pruning ratio ⇒ what kind of hardware is available for more fine grained pruning?

EIE: Weight Sparsity + Activation Sparsity for GEMM (p.313)

all 3 can be stacked together ⇒ the speedup multiplies across all 3

Retrospective on EIE - 2023 retrospective paper on original paper in 2016

Pros
- exploits both weight sparsity and activation sparsity (p.336)
- supports fine-grained sparsity, and allows pruning to achieve a higher pruning ratio (p.336)
- Aggressive weight quantization (4bit) to save memory footprint (p.336)
  - EIE decodes the weight to 16bit and uses 16bit arithmetic (p.336)
  - W4A16 approach is reborn in LLM: GPTQ, AWQ, llama.cpp, MLC LLM (p.336)
    - 4bit weight, 16bit activation
Cons
- isn’t as easily applied to arrays of vector processors — improve: structured sparsity (N:M sparsity) (p.336) ⇒ per new NVIDIA
- only support FC layers - actually reborn in LLM- (p.336)
- fits everything on SRAM - practical for TinyML, not LLM (p.336)
  - TPU designed for 28MB of SRAM
  - because the weight is 28m parameters
  - commercial normally use SRAM to fit all the weights

(p.337) • Generative AI: spatial sparsity (SIGE, NeurIPS’22) • Transformer: token sparsity, progressive quantization (SpAtten, HPCA’21) • Video: temporal sparsity (TSM, ICCV’19) • Point cloud: spatial sparsity (TorchSparse, MLSys’22 & PointAcc, Micro’22)

(p.337) future AI models will be sparse at various granularity and structures

MoE models (GPT4) - giant model each time only use a portion of the model, not all of them.

(p.337) Codesigned with specialized accelerators, sparse models will become more efficient and accessible.

NVIDIA Tensor Core: M:N Weight Sparsity Sparsity (p.338)

2:4 weight sparsity / M:N sparsity

at most 2 non zero every 4 elements

TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution (p.344)

TorchSparse: Sparse Convolution Library (p.345)

after convolution even though it was previously sparse in the input feature map, it will become dense!
how to make it still sparse, with same sparsity pattern as input?
still try to pad some with zeros to allow for parallelism, since if all irregular, difficult to parallel across gpu
Separate computation (baseline) : many kernel calls, low device utilization (p.362)
Dense convolution: best regularity but large computation overhead (p.363)
Computation with grouping: balancing overhead and regularity (p.364)

PointAcc: Hardware Accelerator for Sparse Convolution (p.370)

specific hardware for point cloud acceleration

Current Challenges

LLM pruning still challenging in speedups due to finetuning challenges
iterative magnitude pruning for large datasets like imagenet ^78b8e6

Darius Knowledge Hub

Explorer

EfficientDL - 4. Pruning and Sparsity (Part II)

EfficientDL - 4. Pruning and Sparsity (Part II)

Pruning Ratio (p.261)

Sensitivity Analysis

AMC: AutoML for Model Compression (p.281)

NetAdapt (p.287)

Fine-tuning / Training (p.296)

Iterative Pruning (p.305)

Regularization (p.306)

The Lottery Ticket Hypothesis (p.308)

System Support for Sparsity (p.311)

EIE: Weight Sparsity + Activation Sparsity for GEMM (p.313)

NVIDIA Tensor Core: M:N Weight Sparsity Sparsity (p.338)

TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution (p.344)

TorchSparse: Sparse Convolution Library (p.345)

PointAcc: Hardware Accelerator for Sparse Convolution (p.370)

Current Challenges

Graph View

Backlinks