EfficientDL - 1. Introduction
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
- given current best gpus, and the current large models,
- gap just gets larger without model compression
- how to make it more scalable?
(p.5) NVIDIA Ampere Sparse Tensor Core
- 2 x more flops with sparse core
Vision
- maintain high acc, but smaller models
(p.14) Training is more expensive than inference, hard to fit edge hardware (limited memory)
- edge devices inference - no need to send to cloud
- edge devices training - continuous training / life-long learning with more privacy at lower cost
increasing trend to incorporate prompts into segmentation tasks - SAM
(p.16) SAM runs at 12 images per second due to the large vision transformer model (ViT-Huge) (p.17) EfficientViT 842 image/s
increasing trend of usage of generative models e.g. GANs, diffusion models
(p.19) Training Stable Diffusion costs $600,000 (256 A100s, 150k hours) (p.23) SIGE accelerates Stable Diffusion by >4X with spatial sparsity (p.23) Stable Diffusion: 1855 GMACs 369ms, Ours: 514G (3.6x) 95.0ms (3.9x)
- but MACs is not a good metric! Latency and Memory are better metrics to look at the actual speedups.
- MAC reduction does not always lead to a speedup
- MAC - Multiply and Accumulate
(p.24) FastComposer achieves tuning-free multi-subject image generation
- domain specific augmentations, generate images in different scenarios
what about 3d image generation? what about videos?
(p.28) “Video modeling is a harder task for which performance is not yet saturated at 5.6B model size”
- not even adding the activations, just the model itself
what about 3D vision / 3D perception?
- self driving cars e.g. waymo
-6.png)
(p.30) Fast-LiDARNet accelerates 3D perception with algorithm/system co-design
- lidar is very sparse
(p.31) BEVFusion supports efficient multi-task multi-sensor fusion
- how to fuse information from multiple sensors, but use minimal platform e.g. jetson orin
Language
- code generation want to keep local without uploading to the cloud
- neural machine translation - change to locally done, offline mode
(p.36) Lite Transformer reduces the model size with pruning and quantization
- transformer architecture, how to optimise?
features of LLM
(p.37) zero/few learning (p.38) comes at the cost of large model size (p.39) chain-of-thought
(p.41) comes at the cost of large model size (p.42) size of language models is growing exponentially
- therefore there is a need for more efficient transformers for language
(p.43) SpAtten accelerates language models by pruning redundant token
-
sparse attention
-
dream is to run LLMs, most powerful models in laptops, cars, robots, in space
(p.44) Deploying LLM on the edge is useful (p.44) laptops, cars, robots, and more (p.44) resource-constrained, low-power and sometimes do not have access to the Internet (p.44) Data privacy is important. Users do not want to share personal data with large companies (p.45) enable edge deployment of LLMs through quantization: SmoothQuant and AWQ (p.45) TinyChatEngine implements the compressed inference
Multimodal
(p.53) AWQ quantizes vision-language models to 4 bits with high quality
e.g. vision language action models, RT-1
(p.54) Run at only 3Hz due to the high computational cost and networking latency
e.g. alphago
(p.55) Compute: 1920 CPUs and 280 GPUs ($3000 electric bill per game)
e.g. alphafold protein discovery
(p.56) Compute: 16 TPUv3s (128 TPUv3 cores) for a few weeks
- parallel processing with GPUs
A100 - first addition of structured sparsity into the hardware
(p.59) software cost dominates the cost breakdown of advanced technology nodes
- software techniques that are hardware aware is important!
For Cloud AI hardware P100 (2016) ⇒ V100 (2017) ⇒ A100 (2020) ⇒ H100 (2022)
For Edge AI Hardware
(p.61) Qualcomm Hexagon DSP (p.62) Apple Neural Engine, ML inference on Apple silicon (p.63) NVIDIA Jetson is a complete System on Module
- Nano ⇒ TX2 ⇒ Xavier NX ⇒ AGX Xavier ⇒ AGX Orin (32GB) ⇒ AGX Orin (64GB)
(p.64) Tensor Processing Unit (p.64) application-specific integrated circuit (ASIC) developed by Google (p.64) TensorFlow software
- for tensorflow
(p.65) FPGA-based Accelerators (p.65) higher performance compared to a fixed-architecture AI accelerator like a GPU due to efficiency of custom hardware acceleration.
- alot more flexibility
(p.66) Microcontrollers (MCU)
- very power efficient, very small memory, but also very low cost
p.70
Peak performance does not always translate to measured speedup
-7.png)