EfficientDL - 1. Introduction

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

p.3

given current best gpus, and the current large models,
- gap just gets larger without model compression
how to make it more scalable?

(p.5) NVIDIA Ampere Sparse Tensor Core

2 x more flops with sparse core

Vision

p.10

maintain high acc, but smaller models

(p.14) Training is more expensive than inference, hard to fit edge hardware (limited memory)

edge devices inference - no need to send to cloud
edge devices training - continuous training / life-long learning with more privacy at lower cost

increasing trend to incorporate prompts into segmentation tasks - SAM

(p.16) SAM runs at 12 images per second due to the large vision transformer model (ViT-Huge) (p.17) EfficientViT 842 image/s

increasing trend of usage of generative models e.g. GANs, diffusion models

(p.19) Training Stable Diffusion costs $600,000 (256 A100s, 150k hours) (p.23) SIGE accelerates Stable Diffusion by >4X with spatial sparsity (p.23) Stable Diffusion: 1855 GMACs 369ms, Ours: 514G (3.6x) 95.0ms (3.9x)

but MACs is not a good metric! Latency and Memory are better metrics to look at the actual speedups.
MAC reduction does not always lead to a speedup
MAC - Multiply and Accumulate

(p.24) FastComposer achieves tuning-free multi-subject image generation

domain specific augmentations, generate images in different scenarios

what about 3d image generation? what about videos?

(p.28) “Video modeling is a harder task for which performance is not yet saturated at 5.6B model size”

not even adding the activations, just the model itself

what about 3D vision / 3D perception?

self driving cars e.g. waymo

p.29

(p.30) Fast-LiDARNet accelerates 3D perception with algorithm/system co-design

lidar is very sparse

(p.31) BEVFusion supports efficient multi-task multi-sensor fusion

how to fuse information from multiple sensors, but use minimal platform e.g. jetson orin

Language

code generation want to keep local without uploading to the cloud
neural machine translation - change to locally done, offline mode

(p.36) Lite Transformer reduces the model size with pruning and quantization

transformer architecture, how to optimise?

features of LLM

(p.37) zero/few learning (p.38) comes at the cost of large model size (p.39) chain-of-thought

p.40

(p.41) comes at the cost of large model size (p.42) size of language models is growing exponentially

therefore there is a need for more efficient transformers for language

(p.43) SpAtten accelerates language models by pruning redundant token

sparse attention
dream is to run LLMs, most powerful models in laptops, cars, robots, in space

(p.44) Deploying LLM on the edge is useful (p.44) laptops, cars, robots, and more (p.44) resource-constrained, low-power and sometimes do not have access to the Internet (p.44) Data privacy is important. Users do not want to share personal data with large companies (p.45) enable edge deployment of LLMs through quantization: SmoothQuant and AWQ (p.45) TinyChatEngine implements the compressed inference

Multimodal

(p.53) AWQ quantizes vision-language models to 4 bits with high quality

e.g. vision language action models, RT-1

(p.54) Run at only 3Hz due to the high computational cost and networking latency

e.g. alphago

(p.55) Compute: 1920 CPUs and 280 GPUs ($3000 electric bill per game)

e.g. alphafold protein discovery

(p.56) Compute: 16 TPUv3s (128 TPUv3 cores) for a few weeks

p.57

parallel processing with GPUs

A100 - first addition of structured sparsity into the hardware

(p.59) software cost dominates the cost breakdown of advanced technology nodes

software techniques that are hardware aware is important!

For Cloud AI hardware P100 (2016) ⇒ V100 (2017) ⇒ A100 (2020) ⇒ H100 (2022)

For Edge AI Hardware

(p.61) Qualcomm Hexagon DSP (p.62) Apple Neural Engine, ML inference on Apple silicon (p.63) NVIDIA Jetson is a complete System on Module

Nano ⇒ TX2 ⇒ Xavier NX ⇒ AGX Xavier ⇒ AGX Orin (32GB) ⇒ AGX Orin (64GB)

(p.64) Tensor Processing Unit (p.64) application-specific integrated circuit (ASIC) developed by Google (p.64) TensorFlow software

for tensorflow

(p.65) FPGA-based Accelerators (p.65) higher performance compared to a fixed-architecture AI accelerator like a GPU due to efficiency of custom hardware acceleration.

alot more flexibility

(p.66) Microcontrollers (MCU)

very power efficient, very small memory, but also very low cost p.70

Peak performance does not always translate to measured speedup

Darius Knowledge Hub

Explorer

EfficientDL - 1. Introduction

EfficientDL - 1. Introduction

Vision

Language

Multimodal

Graph View

Backlinks