Vision Transformers (ViT)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction to ViT

  • Transformers lack Inductive Bias or Inductive Prior such as translation invariance and local receptive field unlike CNNs (but there is some form of receptive field in the form of attention)
    • But, transformers are permute invariant (similar to translation)
  • Steps:
    1. Split image into patches
    2. Flatten patches
    3. Produce low-dim linear embeddings from flattened patches
    4. Add positional embeddings
    5. Feed sequence as input to standard transformer encoder
    6. Pretrain model with image labels (fully sup on HUGE dataset)
    7. Finetune model on downstream dataset
  • Identical encoder block to the original Transformer Encoder
  • Types of models
  • NO DECODER!
    • only a MLP head for final classification
  • Finetuning
    • Pretrained on large dataset, then finetuned w small dataset
    • Done by discarding prediction head (MLP head) and attach new D x K linear layer, where K = number of classes of small dataset
    • For higher resolution target images, 2D interpolation of pre-trained position embedding is needed
      • Positional embeddings modelled w trainable linear layers

Representing image as sequence of patches

  • Image is split into patches according to:
  • Then patches converted to embeddings using linear transform layer from input to output

Positional embeddings

  • The scheme of it does not matter much WHY?
    • Likely due to encoder operating at patch-level
    • Relationships between patches (i.e. spatial information) not crucial
      • why is it not crucial?
        • because the patches + embeddings goes into the transformer encoder, which then has global receptive field?
  • Trainable position embedding is added using a sinusoidal structure

Why does it work? - Mean attention distance vs Receptive Field

  • mean attention distance - distance over which the attention goes according to the image
    • as the depth gets greater, the heads all generally gets closer to global computation / attention / receptive field
    • even at low network depth, some heads would already have close to global computation / attention / receptive field (yellow shaded portion) CNNs wont have this! CNNs build their receptive field over depth, starts small and gets larger with depth

Benefits vs CNN

  • Early in the network, the ViT can already have global attention, vs CNNs where the receptive field only gets larger with more layers
    • means that ViTs can gain information from the whole image early, while CNNs are only local across the conv filter
  • CNN inductive bias will bias the estimator towards a certain way of learning i.e. via receptive field
    • With more data for a ViT, can learn the true representation!
    • With more data unbiased model (ViT) will perform better than a biased model (CNN)
    • but the transformer also has some strong inductive bias e.g. the skip connections
  • Scaling
    • Compute
    • Data
      • Sure, if you have “small” amounts of data or are limited in your compute, then CNNs are the better option since they have an inductive bias specifically designed for that. But if you think back to basic Machine Learning theory: a simpler model will perform better if you have limited data, but the more data you have, the more complex your model should become. And Transformers can learn/express much more complex functions than CNNs.
      • If you have a ton of data, then CNN’s inductive bias actually becomes a hindrance. It’s too strict of a corset to truly nail complex scenes: Sometimes it just is bloody helpful to be able to quickly exchange information between far away pixels: whenever you need more than one piece of information to truly understand what’s going on in an image, and those pieces of information are scattered within the scene
      • Needing large datasets as the benchmark - only once you get orders of magnitude bigger than imagenet that ViT truly shines.

Limitations vs CNN

  • Need massive amounts of data, and hence massive compute requirements
    • Massive amounts = need datasets of >14M images to beat SOTA CNNs (contentious in amounts actually needed)
    • Otherwise just stick to resnet / efficientnet
    • why need so much? is it because encoder only? ✅ 2023-12-22
      • possible answer is because the model needs to learn the inductive bias towards images
  • CNN has good Inductive Bias or Inductive Prior for images, which is that:
    • probably what 1 pixel cares about is its immediate neighbourhood, and what that nbh cares about is its immediate neighbourhood and this is modelled in the CNN!
    • hence with lesser data, you can learn better since it is likely that this bias will help the model!
    • With less data biased model (CNN) will perform better than a unbiased model (ViT)
    • But, a bias is not a perfect match for the true representation
  • [D] Why Vision Tranformers? : r/MachineLearning

ViT Variants (incl OD, SemSeg)

  • Quite afew try to introduce inductive bias / prior from CNNs

Multiple directions on improving/building upon ViT:

  • Looking for new “self-attention” blocks (XCIT)
  • Looking for new combinations of existing blocks and ideas from NLP (PVT, SWIN)
  • Adapting ViT architecture to a new domain/task (i.e. SegFormer, UNETR)
  • Forming architectures based on CNN design choices (MViT)
  • Studying scaling up and down ViTs for optimal transfer learning performance.
  • Searching for suitable pretext task for deep unsupervised/self-supervised learning (DINO)

Swin Transformers

Data-efficient Image Transformer (DeiT)

Masked Autoencoder (MAE)

  • Method for self-supervised pre-training of Vision Transformers
  • Shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.
    • i.e. by masking a large portion (75%) of the image patches, the model must reconstruct raw pixel values
    • A form of more general denoising autoencoders
    • Decoder (asymmetric Autoencoder (AE)) is used for patch reconstruction
    • Encoder of MAE (ViT) is only used to encode the visual patches (only those visible unmasked patches). Encoded patches are concat with mask tokens.
  • After pre-training, one “throws away” the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing.

Token Clustering Transformer (TCFormer)

DINO

  • Self supervised Learning
  • Multi-crop idea where teacher sees only global views while the student has access to both global and local views of the transformed input image
    • not as beneficial for CNNs v ViTs
  • attention maps illustrate that the model automatically learns class-specific features leading to unsupervised object segmentation

MLP-Mixer

  • [2105.01601] MLP-Mixer: An all-MLP Architecture for Vision
  • architecture based exclusively on multi-layer perceptrons (MLPs)
  • 2 types of layers:
    • one with MLPs applied independently to image patches (i.e. “mixing” the per-location features)
    • one with MLPs applied across patches (i.e. “mixing” spatial information).

ConvMixer

  • convs to mix across channel depth
    • depthwise convolutions are responsible for mixing spatial locations
    • while pointwise convolution (1x1xchannels kernels) for mixing channel locations
  • operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.

EfficientViT

  • 2 papers titled EfficientViT
    • EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
      • ICCV 2023
      • Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han
      • Code
      • Main Idea - “Multi-Scale Linear Attention module” for high-resolution dense prediction tasks
        • ReLU linear attention instead of softmax attention, enhanced with convolution
        • no hardware-inefficient operations
      • Specific methods
        • aggregate nearby tokens with small-kernel convolutions generates multi-scale tokens
        • ReLU linear attention performed on these multi-scale tokens combines global receptive field with multi-scale learning
        • insert depth-wise convolutions into FFN layers to improve local feature extraction capacity
      • used for semseg, SAM
      • Implemented in NVIDIA Jetson Generative AI Lab
    • EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
      • CVPR 2023
      • Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan
      • Code, Blog

MobileViT

ViT Variants (OD, SemSeg)

Detection Transformer (DETR) (OD)

  • End-to-End Object Detection with Transformers
  • DeTR DINO Grounding DINO / Mask DINO / Segment Anything Grounding SAM, SEEM

OneFormer (Panoptic Seg, Instance Seg, Sem Seg)

Segformer

  • proposed by nvidia
  • semantic segmentation

CNNs since ViT (or its variants)

ConvNeXt

Questions

  • what is (multi-scale) linear attention? (as per EfficientViT) ✅ 2023-12-28
  • what about DETR? ✅ 2023-12-21

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks