Vision Transformers (ViT)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Overview

Attention - Prerequisite Knowledge to Transformers
Transformers

Introduction to ViT

Transformers lack Inductive Bias or Inductive Prior such as translation invariance and local receptive field unlike CNNs (but there is some form of receptive field in the form of attention)
- But, transformers are permute invariant (similar to translation)
Steps:
1. Split image into patches
2. Flatten patches
3. Produce low-dim linear embeddings from flattened patches
4. Add positional embeddings
5. Feed sequence as input to standard transformer encoder
6. Pretrain model with image labels (fully sup on HUGE dataset)
7. Finetune model on downstream dataset
Identical encoder block to the original Transformer Encoder
Types of models
- Heads = Multi-head Attention, Parallel Implementation {CORE IDEA}
- MLP size = linear transformation layers size
- D = embedding size
  - kept fixed throughout the layers to use short residual skip connections
NO DECODER!
- only a MLP head for final classification
Finetuning
- Pretrained on large dataset, then finetuned w small dataset
- Done by discarding prediction head (MLP head) and attach new D x K linear layer, where K = number of classes of small dataset
- For higher resolution target images, 2D interpolation of pre-trained position embedding is needed
  - Positional embeddings modelled w trainable linear layers

Representing image as sequence of patches

Image is split into patches according to:

(x) \in R^{H \times W \times C} where sequence length, ⟶ (x)_{p} \in R^{N \times (P^{2} C)} N = \frac{H W}{P ^{2}}

Then patches converted to embeddings using linear transform layer from $P^{2} C$ input to output $D$

Positional embeddings

The scheme of it does not matter much ⇒ WHY?
- Likely due to encoder operating at patch-level
- Relationships between patches (i.e. spatial information) not crucial
  - why is it not crucial?
    - because the patches + embeddings goes into the transformer encoder, which then has global receptive field?
Trainable position embedding is added using a sinusoidal structure

Why does it work? - Mean attention distance vs Receptive Field

mean attention distance - distance over which the attention goes according to the image
- as the depth gets greater, the heads all generally gets closer to global computation / attention / receptive field
- even at low network depth, some heads would already have close to global computation / attention / receptive field (yellow shaded portion) ⇒ CNNs wont have this! CNNs build their receptive field over depth, starts small and gets larger with depth

Benefits vs CNN

Early in the network, the ViT can already have global attention, vs CNNs where the receptive field only gets larger with more layers
- means that ViTs can gain information from the whole image early, while CNNs are only local across the conv filter
CNN inductive bias will bias the estimator towards a certain way of learning ⇒ i.e. via receptive field
- With more data for a ViT, can learn the true representation!
- With more data ⇒ unbiased model (ViT) will perform better than a biased model (CNN)
- but the transformer also has some strong inductive bias → e.g. the skip connections
Scaling
- Compute
  - Sharding [2302.05442] Scaling Vision Transformers to 22 Billion Parameters
- Data
  - Sure, if you have “small” amounts of data or are limited in your compute, then CNNs are the better option since they have an inductive bias specifically designed for that. But if you think back to basic Machine Learning theory: a simpler model will perform better if you have limited data, but the more data you have, the more complex your model should become. And Transformers can learn/express much more complex functions than CNNs.
  - If you have a ton of data, then CNN’s inductive bias actually becomes a hindrance. It’s too strict of a corset to truly nail complex scenes: Sometimes it just is bloody helpful to be able to quickly exchange information between far away pixels: whenever you need more than one piece of information to truly understand what’s going on in an image, and those pieces of information are scattered within the scene
  - Needing large datasets as the benchmark - only once you get orders of magnitude bigger than imagenet that ViT truly shines.

Limitations vs CNN

Need massive amounts of data, and hence massive compute requirements
- Massive amounts = need datasets of >14M images to beat SOTA CNNs (contentious in amounts actually needed)
- Otherwise just stick to resnet / efficientnet
- why need so much? is it because encoder only? ✅ 2023-12-22
  - possible answer is because the model needs to learn the inductive bias towards images
CNN has good Inductive Bias or Inductive Prior for images, which is that:
- probably what 1 pixel cares about is its immediate neighbourhood, and what that nbh cares about is its immediate neighbourhood ⇒ and this is modelled in the CNN!
- hence with lesser data, you can learn better since it is likely that this bias will help the model!
- With less data ⇒ biased model (CNN) will perform better than a unbiased model (ViT)
- But, a bias is not a perfect match for the true representation
[D] Why Vision Tranformers? : r/MachineLearning

ViT Variants (incl OD, SemSeg)

Quite afew try to introduce inductive bias / prior from CNNs

Multiple directions on improving/building upon ViT:

Looking for new “self-attention” blocks (XCIT)
Looking for new combinations of existing blocks and ideas from NLP (PVT, SWIN)
Adapting ViT architecture to a new domain/task (i.e. SegFormer, UNETR)
Forming architectures based on CNN design choices (MViT)
Studying scaling up and down ViTs for optimal transfer learning performance.
Searching for suitable pretext task for deep unsupervised/self-supervised learning (DINO)

Swin Transformers

local window for performing self-attention
hierarchical transformer that reintroduces ConvNet priors
Swin Transformer paper animated and explained - YouTube

Data-efficient Image Transformer (DeiT)

Review: Data Efficient Image Transformer (DeiT) | by Sik-Ho Tsang | Medium
Since ViT does not generalize well when trained on insufficient amounts of data, proposed DeiT
Mostly same architecture as ViT, but trained on ImageNet only, no external data
Includes distillation token for a teacher-student strategy. see Knowledge Distillation
- For the distillation token - Using a convnet teacher gives better performance than using a transformer.
“According to DeiT, various techniques are required to effectively train ViTs. Thus, we applied data augmentations such as CutMix, Mixup, Auto Augment, Repeated Augment to all models.” Data Augmentation (Images)
DeiT - Data-efficient image transformers & distillation through attention (paper illustrated) - YouTube
GitHub - facebookresearch/deit: Official DeiT repository

Masked Autoencoder (MAE)

Method for self-supervised pre-training of Vision Transformers
- ViTMAE - [2111.06377v2] Masked Autoencoders Are Scalable Vision Learners
- Masked image modeling with Autoencoders
Shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.
- i.e. by masking a large portion (75%) of the image patches, the model must reconstruct raw pixel values
- A form of more general denoising autoencoders
- Decoder (asymmetric Autoencoder (AE)) is used for patch reconstruction
- Encoder of MAE (ViT) is only used to encode the visual patches (only those visible unmasked patches). Encoded patches are concat with mask tokens.
After pre-training, one “throws away” the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing.

Token Clustering Transformer (TCFormer)

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
addressing tokenisation of the image in normal grids vs clustered grids that can be reshaped such that it fits the semantic concept, and be fine grained for important areas

DINO

Self supervised Learning
Multi-crop idea where teacher sees only global views while the student has access to both global and local views of the transformed input image
- not as beneficial for CNNs v ViTs
attention maps illustrate that the model automatically learns class-specific features leading to unsupervised object segmentation

MLP-Mixer

[2105.01601] MLP-Mixer: An all-MLP Architecture for Vision
architecture based exclusively on multi-layer perceptrons (MLPs)
2 types of layers:
- one with MLPs applied independently to image patches (i.e. “mixing” the per-location features)
- one with MLPs applied across patches (i.e. “mixing” spatial information).

ConvMixer

convs to mix across channel depth
- depthwise convolutions are responsible for mixing spatial locations
- while pointwise convolution (1x1xchannels kernels) for mixing channel locations
operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.

EfficientViT

2 papers titled EfficientViT
- EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
  - ICCV 2023
  - Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han
  - Code
  - Main Idea - “Multi-Scale Linear Attention module” for high-resolution dense prediction tasks
    - ReLU linear attention instead of softmax attention, enhanced with convolution
    - no hardware-inefficient operations
  - Specific methods
    - aggregate nearby tokens with small-kernel convolutions ⇒ generates multi-scale tokens
    - ReLU linear attention performed on these multi-scale tokens ⇒ combines global receptive field with multi-scale learning
    - insert depth-wise convolutions into FFN layers to improve local feature extraction capacity
  - used for semseg, SAM
  - Implemented in NVIDIA Jetson Generative AI Lab
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
  - CVPR 2023
  - Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan
  - Code, Blog

MobileViT

ViT Variants (OD, SemSeg)

Detection Transformer (DETR) (OD)

End-to-End Object Detection with Transformers
DeTR ⇒ DINO ⇒ Grounding DINO / Mask DINO / Segment Anything ⇒ Grounding SAM, SEEM

OneFormer (Panoptic Seg, Instance Seg, Sem Seg)

2211.06220.pdf

Segformer

proposed by nvidia
semantic segmentation

CNNs since ViT (or its variants)

ConvNeXt

Questions

what is (multi-scale) linear attention? (as per EfficientViT) ✅ 2023-12-28
- see EfficientViT
what about DETR? ✅ 2023-12-21

Theoretical References

Papers

Articles

AI Summer
1. How the Vision Transformer (ViT) works in 10 minutes: an image is worth 16x16 words | AI Summer
2. Transformers in computer vision: ViT architectures, tips, tricks and improvements | AI Summer

Darius Knowledge Hub

Explorer

Vision Transformers (ViT)

Vision Transformers (ViT)

Overview

Introduction to ViT

Representing image as sequence of patches

Positional embeddings

Why does it work? - Mean attention distance vs Receptive Field

Benefits vs CNN

Limitations vs CNN

ViT Variants (incl OD, SemSeg)

Swin Transformers

Data-efficient Image Transformer (DeiT)

Masked Autoencoder (MAE)

Token Clustering Transformer (TCFormer)

DINO

MLP-Mixer

ConvMixer

EfficientViT

MobileViT

ViT Variants (OD, SemSeg)

Detection Transformer (DETR) (OD)

OneFormer (Panoptic Seg, Instance Seg, Sem Seg)

Segformer

CNNs since ViT (or its variants)

ConvNeXt

Questions

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks

Darius Knowledge Hub

Explorer

Vision Transformers (ViT)

Vision Transformers (ViT)

Overview

Related fields

Introduction to ViT

Representing image as sequence of patches

Positional embeddings

Why does it work? - Mean attention distance vs Receptive Field

Benefits vs CNN

Limitations vs CNN

ViT Variants (incl OD, SemSeg)

Swin Transformers

Data-efficient Image Transformer (DeiT)

Masked Autoencoder (MAE)

Token Clustering Transformer (TCFormer)

DINO

MLP-Mixer

ConvMixer

EfficientViT

MobileViT

ViT Variants (OD, SemSeg)

Detection Transformer (DETR) (OD)

OneFormer (Panoptic Seg, Instance Seg, Sem Seg)

Segformer

CNNs since ViT (or its variants)

ConvNeXt

Questions

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks