EfficientDL - 2. Basics of Deep Learning

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

NOTE

mostly same as per normal undergrad deep learning / computer vision courses or those by Andrew Ng

will focus on some of these parts that may not be in all of such courses, or their implications to efficiency

(p.93) 4. Introduce popular efficiency metrics for neural networks - # Parameters, Model Size, Peak # Activations, MAC, FLOP, FLOPS, OP, OPS, Latency, Throughput

will focus more on this portion
how to analytically know the difference In the speedups?

Basics

Convolution Layer: Receptive Field (p.120)

see a global view / larger patch of the image, rather than just the kernel size
important concept for mcunet — shrinking activation size, doing patch based inference
how to enlarge receptive field without increasing number of layers (makes it slow), or without increasing kernel size (but increases number of weights)? ⇒ downsample the feature map

Downsample inside the neural network (p.120)

e.g. strided conv layer (p.121)

Grouped Convolution Layer (p.122)

not all outputs depends on all inputs
e.g. shard by half — some output channels depend on only some input channels
to save computing at the weights ⇒ reduce the number of weights!
- with $g$ groups, it will reduced number of weights by $g$ times
if $g$ == number of input channels and output channels ⇒ Depthwise Convolution Layer

Depthwise Convolution Layer (p.123)

foundation of MobileNet family
not very efficient design
- reduce number of weights drastically
- reduce number of FLOPs drastically
- but the activation size increases to compensate for the reduced number of weights
  - leads to alot of memory movement with increased number of channels
- so yes, parameter efficient, but may not translate to speedups

Normalization Layer (p.125)

makes mean zero, variance one ⇒ zero mean unit variance
many ways to determine the set of elements to do the normalisation for
layer norm - given specific element ⇒ widely used in LLMs
something that can be utilised in efficient research, or other advanced topics!
- only 2 learnable parameters for each dimension ⇒ scaling factor $γ$ and the bias $β$
- 1 dimensional vector that is quite parameter efficient
- when doing fine tuning, cost efficient way is to just finetune the scaling factor and bias in the batch norm / any type of normalisation layer
- it is one of the Parameter Efficient Fine-Tuning (PEFT) techniques

Activation Function (p.126)

ReLU is very hardware friendly
- There exists clipped ReLU like ReLU6 to make it easier to quantize
Some are very difficult to quantize and are hardware unfriendly ⇒ e.g. Swish / Hard Swish

ResNet-50 (p.132)

why need the 1x1?
- shrink the number of channel size, reduce number of parameters
- to reduce computation done during the 3x3
final 1x1 to reproject it back to N

MobileNetV2 (p.133)

this is the downside!
very big expansion ratio
inverted bottleneck ⇒ from N becomes N*6

Efficiency Metrics

How should we measure the efficiency of neural networks? (p.134)

Latency (p.137)

Measures the delay for a specific task (p.137)
in ms
lower latency is better!
can be compute or memory bounded

Throughput (p.138)

Measures the rate at which data is processed (p.138)
in videos / s, or images / s, or instances / s
higher throughput is better!

Latency vs. Throughput (p.139)

they do not correlate to each other ⇒ not translatable in between both topics
batching / parallel processing across more CUDA cores improves the throughput
- but latency does not necessarily reduce!
optimising for latency is generally more difficult ⇒ how?
- overlapping the compute with the memory access

Energy Consumption (p.141)

memory referencing is super energy cost expensive
200 x more expensive than doing a MUL/ADD

Number of Parameters (# Parameters) (p.145)

is the parameter (synapse/weight) count of the given neural network, i.e., the number of elements in the weight tensors (p.145)

Model Size (p.152)

measures the storage for the weights of the given neural network (p.152)
in MegaBytes (MB), KiloBytes (KB) etc
assuming all use same datatype - e.g. fp32
bit width ⇒ in bytes

Number of Activations (# Activations) (p.154)

is the memory bottleneck in inference on IoT, not # Parameters (p.154)
calculated by summing $c_{i} \times h_{i} \times w_{i}$ or $c_{o} \times h_{o} \times w_{o}$ (number of channels x height x width) across all layers for the input and output channels / dims
- ensuring no double counting in-between layers
peak # activations calculated by $# input activation + # output activation$ in each specific layer
# Activation didn’t improve from ResNet to MobileNet-v2 (p.155)
sometimes peak activation size is the real bottle neck
e.g. if 1 layer is alot larger than others

Number of Multiply-Accumulate Operations (MAC) (p.161)

MAC / MV / GEMM
- p.165
Multiply-Accumulate operation (MAC)(p.161)
Matrix-Vector Multiplication (MV) (p.161)
General Matrix-Matrix Multiplication (GEMM) (p.161)

Number of Floating Point Operations (FLOP) (p.168)

A multiply is a floating point operation (p.168)
An add is a floating point operation
One multiply-accumulate (MAC) operation is two floating point operations (FLOP), assuming the MAC operands are floating point
Floating Point Operation Per Second (FLOPS) (p.168)
Number of Operations (OP) (p.169)
- Activations/weights in neural network computing are not always floating point. generalize! (p.169)
Operation Per Second (OPS) (p.169)

Darius Knowledge Hub

Explorer

EfficientDL - 2. Basics of Deep Learning

EfficientDL - 2. Basics of Deep Learning

Basics

Convolution Layer: Receptive Field (p.120)

Downsample inside the neural network (p.120)

Grouped Convolution Layer (p.122)

Depthwise Convolution Layer (p.123)

Normalization Layer (p.125)

Activation Function (p.126)

ResNet-50 (p.132)

MobileNetV2 (p.133)

Efficiency Metrics

Latency (p.137)

Throughput (p.138)

Latency vs. Throughput (p.139)

Energy Consumption (p.141)

Number of Parameters (# Parameters) (p.145)

Model Size (p.152)

Number of Activations (# Activations) (p.154)

Number of Multiply-Accumulate Operations (MAC) (p.161)

Number of Floating Point Operations (FLOP) (p.168)

Graph View

Backlinks