EfficientDL - 2. Basics of Deep Learning


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


NOTE

  • mostly same as per normal undergrad deep learning / computer vision courses or those by Andrew Ng
  • will focus on some of these parts that may not be in all of such courses, or their implications to efficiency

(p.93) 4. Introduce popular efficiency metrics for neural networks - # Parameters, Model Size, Peak # Activations, MAC, FLOP, FLOPS, OP, OPS, Latency, Throughput

  • will focus more on this portion
  • how to analytically know the difference In the speedups?

Basics

Convolution Layer: Receptive Field (p.120)
  • see a global view / larger patch of the image, rather than just the kernel size
  • important concept for mcunet — shrinking activation size, doing patch based inference
  • how to enlarge receptive field without increasing number of layers (makes it slow), or without increasing kernel size (but increases number of weights)? downsample the feature map
Downsample inside the neural network (p.120)
  • e.g. strided conv layer (p.121)
Grouped Convolution Layer (p.122)
  • not all outputs depends on all inputs
  • e.g. shard by half — some output channels depend on only some input channels
  • to save computing at the weights reduce the number of weights!
    • with groups, it will reduced number of weights by times
  • if == number of input channels and output channels Depthwise Convolution Layer
Depthwise Convolution Layer (p.123)
  • foundation of MobileNet family
  • not very efficient design
    • reduce number of weights drastically
    • reduce number of FLOPs drastically
    • but the activation size increases to compensate for the reduced number of weights
      • leads to alot of memory movement with increased number of channels
    • so yes, parameter efficient, but may not translate to speedups
Normalization Layer (p.125)
  • makes mean zero, variance one zero mean unit variance
  • many ways to determine the set of elements to do the normalisation for
  • layer norm - given specific element widely used in LLMs
  • something that can be utilised in efficient research, or other advanced topics!
    • only 2 learnable parameters for each dimension scaling factor and the bias
    • 1 dimensional vector that is quite parameter efficient
    • when doing fine tuning, cost efficient way is to just finetune the scaling factor and bias in the batch norm / any type of normalisation layer
    • it is one of the Parameter Efficient Fine-Tuning (PEFT) techniques
Activation Function (p.126)
  • ReLU is very hardware friendly
    • There exists clipped ReLU like ReLU6 to make it easier to quantize
  • Some are very difficult to quantize and are hardware unfriendly e.g. Swish / Hard Swish
ResNet-50 (p.132)

  • why need the 1x1?
    • shrink the number of channel size, reduce number of parameters
    • to reduce computation done during the 3x3
  • final 1x1 to reproject it back to N
MobileNetV2 (p.133)

  • this is the downside!
  • very big expansion ratio
  • inverted bottleneck from N becomes N*6

Efficiency Metrics

How should we measure the efficiency of neural networks? (p.134)

Latency (p.137)
  • Measures the delay for a specific task (p.137)
  • in ms
  • lower latency is better!
  • can be compute or memory bounded
Throughput (p.138)
  • Measures the rate at which data is processed (p.138)
  • in videos / s, or images / s, or instances / s
  • higher throughput is better!
Latency vs. Throughput (p.139)

  • they do not correlate to each other not translatable in between both topics

  • batching / parallel processing across more CUDA cores improves the throughput

    • but latency does not necessarily reduce!
  • optimising for latency is generally more difficult how?

    • overlapping the compute with the memory access
Energy Consumption (p.141)
  • memory referencing is super energy cost expensive
  • 200 x more expensive than doing a MUL/ADD
Number of Parameters (# Parameters) (p.145)
  • is the parameter (synapse/weight) count of the given neural network, i.e., the number of elements in the weight tensors (p.145)
Model Size (p.152)
  • measures the storage for the weights of the given neural network (p.152)
  • in MegaBytes (MB), KiloBytes (KB) etc
  • assuming all use same datatype - e.g. fp32
  • bit width in bytes
Number of Activations (# Activations) (p.154)
  • is the memory bottleneck in inference on IoT, not # Parameters (p.154)
  • calculated by summing or (number of channels x height x width) across all layers for the input and output channels / dims
    • ensuring no double counting in-between layers
  • peak # activations calculated by in each specific layer
  • # Activation didn’t improve from ResNet to MobileNet-v2 (p.155)
  • sometimes peak activation size is the real bottle neck
  • e.g. if 1 layer is alot larger than others
Number of Multiply-Accumulate Operations (MAC) (p.161)
  • MAC / MV / GEMM
  • Multiply-Accumulate operation (MAC)(p.161)
  • Matrix-Vector Multiplication (MV) (p.161)
  • General Matrix-Matrix Multiplication (GEMM) (p.161)
Number of Floating Point Operations (FLOP) (p.168)
  • A multiply is a floating point operation (p.168)
  • An add is a floating point operation
  • One multiply-accumulate (MAC) operation is two floating point operations (FLOP), assuming the MAC operands are floating point
  • Floating Point Operation Per Second (FLOPS) (p.168)
  • Number of Operations (OP) (p.169)
    • Activations/weights in neural network computing are not always floating point. generalize! (p.169)
  • Operation Per Second (OPS) (p.169)