EfficientDL - 4. Pruning and Sparsity (Part II)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Pruning

Pruning Ratio (p.261)

How should we find per-layer pruning ratios (p.261)

  • sensitivity of each layer!
    • some layers, prune abit, acc drop alot
    • some layers e.g. fully connected layer / mlp, prune alot, acc stay the same choose this!
  • need different pruning ratios for each layer since different layers have different sensitivity (p.265)
    • Some layers are more sensitive (e.g., first layer) (p.265)
Sensitivity Analysis

to determine the per-layer pruning ratio (p.265)

  • so in this case the green line (layer 1) shows the least dip in acc even at high pruning ratio of 90%
    • so it seems like we can prune this!
    • but one catch is that we are assuming the layers are indep of each other, not interfering with each other
    • why this is done, is a tradeoff of gpu hours and how many points to evaluate for gpu accuracy
    • does this mean that we should prune layer 1?
      • not necessarily, we don't know the actual size of each layer here - the green curve could be a small layer, even if prune 80% it could be very small number only
      • we need to consider the actual size of each layer too AMC / NetAdapt
    • for newer LLMs, might be abit more uniform in each layer size vs CNNs
  • want to make it a one-button solution to automatically do pruning
  • specify target model size, then return the smaller model
AMC: AutoML for Model Compression (p.281)

(p.281) Pruning as a reinforcement learning problem For ResNet-50 (p.285) RL agent automatically learns 3x3 convolutions have more redundancy and can be pruned more (p.285) RL agent automatically learns 1x1 convolutions have less redundancy and can be pruned less

  • latest 2023 0.8ms latency for resnet!
  • why 0.75 mobilenet (prune evenly across layers) 70% speedup?
    • FLOPs calculation for conv layer - 6 terms height, width, size of conv kernel, input channel, output channel since input channel and output channel both is 0.75x size, its a quadratic relationship
NetAdapt (p.287)

(p.287) A rule-based iterative/progressive method

  • find a per-layer pruning ratio to meet a global resource constraint (e.g., latency, energy, …) (p.287)
  • process is done iteratively (p.287)

Fine-tuning / Training (p.296)

(p.296) How should we improve performance of sparse models?

(p.297)

  • Fine-tuning the pruned neural networks will help recover the accuracy and push the pruning ratio higher.
  • Learning rate for fine-tuning is usually 1/100 or 1/10 of the original learning rate.
Iterative Pruning (p.305)
  • Prune Finetune = 1 iteration

Iterative pruning gradually increases the target sparsity in each iteration. (p.305) - e.g. 30% > 50% > 70%

  • better than single-step aggressive pruning
    • boost pruning ratio from 5x to 9x on AlexNet compared to single-step aggressive pruning (p.305)
  • requires high engineering efforts, difficult to automate
    • since need to restart pruning, sensitivity analysis, and pruning for each step
  • for research / single model it is easy to do one off.
  • but if there are some bypass layers (e.g. resnet)
    • there will be some additional dependencies since tensor plus another tensor - must be same size. cannot do elementwise add
    • so need to trace the dependencies to ensure that it matches.
Regularization (p.306)
  • add regularization to loss term so as to:
  • (p.306)

• penalize non-zero parameters • encourage smaller parameters

The Lottery Ticket Hypothesis (p.308)

  • currently, we have to train a model then prune it
  • can we start with a sparse and pruned model and directly train the sparse model and beat the pruned model? (p.308)

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations. - The Lottery Ticket Hypothesis

see Iterative magnitude pruning

  • train to converge, then prune based on magnitude — get the sparsity pattern
  • inherit the sparsity pattern, random initialise the weights, and train it to convergence again
  • for small datasets like CIFAR and MNIST it is still okay, but for large datasets like imagenet this is still challenging! ^78b8e6
    • after 3 - 4 epochs of training + pruning it is enough to determine the sparsity pattern
    • but you still need to go through the training
    • active research area!

System Support for Sparsity (p.311)

  • use coarse grained pruning at filter level - remove entire rows from the matrix
    • can still use dense matrix multiplication libraries to accelerate
    • but limit is lower pruning ratio what kind of hardware is available for more fine grained pruning?
EIE: Weight Sparsity + Activation Sparsity for GEMM (p.313)

  • all 3 can be stacked together the speedup multiplies across all 3

Retrospective on EIE - 2023 retrospective paper on original paper in 2016

  • Pros
    • exploits both weight sparsity and activation sparsity (p.336)
    • supports fine-grained sparsity, and allows pruning to achieve a higher pruning ratio (p.336)
    • Aggressive weight quantization (4bit) to save memory footprint (p.336)
      • EIE decodes the weight to 16bit and uses 16bit arithmetic (p.336)
      • W4A16 approach is reborn in LLM: GPTQ, AWQ, llama.cpp, MLC LLM (p.336)
        • 4bit weight, 16bit activation
  • Cons
    • isn’t as easily applied to arrays of vector processors — improve: structured sparsity (N:M sparsity) (p.336) per new NVIDIA
    • only support FC layers - actually reborn in LLM- (p.336)
    • fits everything on SRAM - practical for TinyML, not LLM (p.336)
      • TPU designed for 28MB of SRAM
      • because the weight is 28m parameters
      • commercial normally use SRAM to fit all the weights

(p.337) • Generative AI: spatial sparsity (SIGE, NeurIPS’22) • Transformer: token sparsity, progressive quantization (SpAtten, HPCA’21) • Video: temporal sparsity (TSM, ICCV’19) • Point cloud: spatial sparsity (TorchSparse, MLSys’22 & PointAcc, Micro’22)

(p.337) future AI models will be sparse at various granularity and structures

  • MoE models (GPT4) - giant model each time only use a portion of the model, not all of them.

(p.337) Codesigned with specialized accelerators, sparse models will become more efficient and accessible.

NVIDIA Tensor Core: M:N Weight Sparsity Sparsity (p.338)

2:4 weight sparsity / M:N sparsity

  • at most 2 non zero every 4 elements
TorchSparse & PointAcc: Activation Sparsity for Sparse Convolution (p.344)
TorchSparse: Sparse Convolution Library (p.345)
  • after convolution even though it was previously sparse in the input feature map, it will become dense!
  • how to make it still sparse, with same sparsity pattern as input?
  • still try to pad some with zeros to allow for parallelism, since if all irregular, difficult to parallel across gpu
  • Separate computation (baseline) : many kernel calls, low device utilization (p.362)
  • Dense convolution: best regularity but large computation overhead (p.363)
  • Computation with grouping: balancing overhead and regularity (p.364)
PointAcc: Hardware Accelerator for Sparse Convolution (p.370)
  • specific hardware for point cloud acceleration

Current Challenges

  • LLM pruning still challenging in speedups due to finetuning challenges
  • iterative magnitude pruning for large datasets like imagenet ^78b8e6