EfficientDL - 2. Basics of Deep Learning
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
NOTE
- mostly same as per normal undergrad deep learning / computer vision courses or those by Andrew Ng
- will focus on some of these parts that may not be in all of such courses, or their implications to efficiency
(p.93) 4. Introduce popular efficiency metrics for neural networks - # Parameters, Model Size, Peak # Activations, MAC, FLOP, FLOPS, OP, OPS, Latency, Throughput
- will focus more on this portion
- how to analytically know the difference In the speedups?
Basics
Convolution Layer: Receptive Field (p.120)
- see a global view / larger patch of the image, rather than just the kernel size
- important concept for mcunet — shrinking activation size, doing patch based inference
-10.png)
- how to enlarge receptive field without increasing number of layers (makes it slow), or without increasing kernel size (but increases number of weights)? ⇒ downsample the feature map
Downsample inside the neural network (p.120)
- e.g. strided conv layer (p.121)
Grouped Convolution Layer (p.122)
- not all outputs depends on all inputs
- e.g. shard by half — some output channels depend on only some input channels
- to save computing at the weights ⇒ reduce the number of weights!
- if == number of input channels and output channels ⇒ Depthwise Convolution Layer
Depthwise Convolution Layer (p.123)
- foundation of MobileNet family
- not very efficient design
- reduce number of weights drastically
- reduce number of FLOPs drastically
- but the activation size increases to compensate for the reduced number of weights
- leads to alot of memory movement with increased number of channels
- so yes, parameter efficient, but may not translate to speedups
Normalization Layer (p.125)
- makes mean zero, variance one ⇒ zero mean unit variance
- many ways to determine the set of elements to do the normalisation for
-12.png)
- layer norm - given specific element ⇒ widely used in LLMs
- something that can be utilised in efficient research, or other advanced topics!
- only 2 learnable parameters for each dimension ⇒ scaling factor and the bias
- 1 dimensional vector that is quite parameter efficient
- when doing fine tuning, cost efficient way is to just finetune the scaling factor and bias in the batch norm / any type of normalisation layer
- it is one of the Parameter Efficient Fine-Tuning (PEFT) techniques
Activation Function (p.126)
- ReLU is very hardware friendly
- There exists clipped ReLU like ReLU6 to make it easier to quantize
- Some are very difficult to quantize and are hardware unfriendly ⇒ e.g. Swish / Hard Swish
-13.png)
ResNet-50 (p.132)
- why need the 1x1?
- shrink the number of channel size, reduce number of parameters
- to reduce computation done during the 3x3
- final 1x1 to reproject it back to N
MobileNetV2 (p.133)
- this is the downside!
- very big expansion ratio
- inverted bottleneck ⇒ from N becomes N*6
Efficiency Metrics
How should we measure the efficiency of neural networks? (p.134)
-16.png)
Latency (p.137)
- Measures the delay for a specific task (p.137)
- in ms
- lower latency is better!
-17.png)
- can be compute or memory bounded
Throughput (p.138)
- Measures the rate at which data is processed (p.138)
- in videos / s, or images / s, or instances / s
- higher throughput is better!
Latency vs. Throughput (p.139)
-
they do not correlate to each other ⇒ not translatable in between both topics
-
batching / parallel processing across more CUDA cores improves the throughput
- but latency does not necessarily reduce!
-
optimising for latency is generally more difficult ⇒ how?
Energy Consumption (p.141)
Number of Parameters (# Parameters) (p.145)
- is the parameter (synapse/weight) count of the given neural network, i.e., the number of elements in the weight tensors (p.145)
-21.png)
Model Size (p.152)
- measures the storage for the weights of the given neural network (p.152)
- in MegaBytes (MB), KiloBytes (KB) etc
-22.png)
- assuming all use same datatype - e.g. fp32
- bit width ⇒ in bytes
Number of Activations (# Activations) (p.154)
- is the memory bottleneck in inference on IoT, not # Parameters (p.154)
- calculated by summing or (number of channels x height x width) across all layers for the input and output channels / dims
- ensuring no double counting in-between layers
- peak # activations calculated by in each specific layer
-23.png)
- # Activation didn’t improve from ResNet to MobileNet-v2 (p.155)
-24.png)
- sometimes peak activation size is the real bottle neck
- e.g. if 1 layer is alot larger than others
-25.png)
Number of Multiply-Accumulate Operations (MAC) (p.161)
- MAC / MV / GEMM
- Multiply-Accumulate operation (MAC)(p.161)
- Matrix-Vector Multiplication (MV) (p.161)
- General Matrix-Matrix Multiplication (GEMM) (p.161)
-30.png)
Number of Floating Point Operations (FLOP) (p.168)
- A multiply is a floating point operation (p.168)
- An add is a floating point operation
- One multiply-accumulate (MAC) operation is two floating point operations (FLOP), assuming the MAC operands are floating point
-31.png)
- Floating Point Operation Per Second (FLOPS) (p.168)
- Number of Operations (OP) (p.169)
- Activations/weights in neural network computing are not always floating point. generalize! (p.169)
- Operation Per Second (OPS) (p.169)
-11.png)
-14.png)
-15.png)
-19.png)
-18.png)
-20.png)
-26.png)
-27.png)
-28.png)
-29.png)
-32.png)
-33.png)