Vision Transformers (ViT)
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview

Related fields
- Attention - Prerequisite Knowledge to Transformers
- Transformers
Introduction to ViT

- Transformers lack Inductive Bias or Inductive Prior such as translation invariance and local receptive field unlike CNNs (but there is some form of receptive field in the form of attention)
- But, transformers are permute invariant (similar to translation)
- Steps:
- Split image into patches
- Flatten patches
- Produce low-dim linear embeddings from flattened patches
- Add positional embeddings
- Feed sequence as input to standard transformer encoder
- Pretrain model with image labels (fully sup on HUGE dataset)
- Finetune model on downstream dataset
- Identical encoder block to the original Transformer Encoder

- Types of models
- Heads = Multi-head Attention, Parallel Implementation {CORE IDEA}
- MLP size = linear transformation layers size
- D = embedding size
- kept fixed throughout the layers to use short residual skip connections
- NO DECODER!
- only a MLP head for final classification
- Finetuning
- Pretrained on large dataset, then finetuned w small dataset
- Done by discarding prediction head (MLP head) and attach new D x K linear layer, where K = number of classes of small dataset
- For higher resolution target images, 2D interpolation of pre-trained position embedding is needed
- Positional embeddings modelled w trainable linear layers
Representing image as sequence of patches
- Image is split into patches according to:
- Then patches converted to embeddings using linear transform layer from input to output
Positional embeddings
- The scheme of it does not matter much ⇒ WHY?
- Likely due to encoder operating at patch-level
- Relationships between patches (i.e. spatial information) not crucial
- why is it not crucial?
- because the patches + embeddings goes into the transformer encoder, which then has global receptive field?
- why is it not crucial?
- Trainable position embedding is added using a sinusoidal structure
Why does it work? - Mean attention distance vs Receptive Field

- mean attention distance - distance over which the attention goes according to the image
- as the depth gets greater, the heads all generally gets closer to global computation / attention / receptive field
- even at low network depth, some heads would already have close to global computation / attention / receptive field (yellow shaded portion) ⇒ CNNs wont have this! CNNs build their receptive field over depth, starts small and gets larger with depth
Benefits vs CNN
- Early in the network, the ViT can already have global attention, vs CNNs where the receptive field only gets larger with more layers
- means that ViTs can gain information from the whole image early, while CNNs are only local across the conv filter
- CNN inductive bias will bias the estimator towards a certain way of learning ⇒ i.e. via receptive field
- With more data for a ViT, can learn the true representation!
- With more data ⇒ unbiased model (ViT) will perform better than a biased model (CNN)
- but the transformer also has some strong inductive bias → e.g. the skip connections
- Scaling
- Compute
- Data
- Sure, if you have “small” amounts of data or are limited in your compute, then CNNs are the better option since they have an inductive bias specifically designed for that. But if you think back to basic Machine Learning theory: a simpler model will perform better if you have limited data, but the more data you have, the more complex your model should become. And Transformers can learn/express much more complex functions than CNNs.
- If you have a ton of data, then CNN’s inductive bias actually becomes a hindrance. It’s too strict of a corset to truly nail complex scenes: Sometimes it just is bloody helpful to be able to quickly exchange information between far away pixels: whenever you need more than one piece of information to truly understand what’s going on in an image, and those pieces of information are scattered within the scene
- Needing large datasets as the benchmark - only once you get orders of magnitude bigger than imagenet that ViT truly shines.
Limitations vs CNN
- Need massive amounts of data, and hence massive compute requirements
- Massive amounts = need datasets of >14M images to beat SOTA CNNs (contentious in amounts actually needed)
- Otherwise just stick to resnet / efficientnet
- why need so much? is it because encoder only? ✅ 2023-12-22
- possible answer is because the model needs to learn the inductive bias towards images
- CNN has good Inductive Bias or Inductive Prior for images, which is that:
- probably what 1 pixel cares about is its immediate neighbourhood, and what that nbh cares about is its immediate neighbourhood ⇒ and this is modelled in the CNN!
- hence with lesser data, you can learn better since it is likely that this bias will help the model!
- With less data ⇒ biased model (CNN) will perform better than a unbiased model (ViT)
- But, a bias is not a perfect match for the true representation
- [D] Why Vision Tranformers? : r/MachineLearning
ViT Variants (incl OD, SemSeg)
- Quite afew try to introduce inductive bias / prior from CNNs
Multiple directions on improving/building upon ViT:
- Looking for new “self-attention” blocks (XCIT)
- Looking for new combinations of existing blocks and ideas from NLP (PVT, SWIN)
- Adapting ViT architecture to a new domain/task (i.e. SegFormer, UNETR)
- Forming architectures based on CNN design choices (MViT)
- Studying scaling up and down ViTs for optimal transfer learning performance.
- Searching for suitable pretext task for deep unsupervised/self-supervised learning (DINO)
Swin Transformers
- local window for performing self-attention
- hierarchical transformer that reintroduces ConvNet priors
- Swin Transformer paper animated and explained - YouTube
Data-efficient Image Transformer (DeiT)
- Review: Data Efficient Image Transformer (DeiT) | by Sik-Ho Tsang | Medium
- Since ViT does not generalize well when trained on insufficient amounts of data, proposed DeiT
- Mostly same architecture as ViT, but trained on ImageNet only, no external data
- Includes distillation token for a teacher-student strategy. see Knowledge Distillation
- For the distillation token - Using a convnet teacher gives better performance than using a transformer.
- “According to DeiT, various techniques are required to effectively train ViTs. Thus, we applied data augmentations such as CutMix, Mixup, Auto Augment, Repeated Augment to all models.” Data Augmentation (Images)
- DeiT - Data-efficient image transformers & distillation through attention (paper illustrated) - YouTube
- GitHub - facebookresearch/deit: Official DeiT repository
Masked Autoencoder (MAE)

- Method for self-supervised pre-training of Vision Transformers
- Shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.
- i.e. by masking a large portion (75%) of the image patches, the model must reconstruct raw pixel values
- A form of more general denoising autoencoders
- Decoder (asymmetric Autoencoder (AE)) is used for patch reconstruction
- Encoder of MAE (ViT) is only used to encode the visual patches (only those visible unmasked patches). Encoded patches are concat with mask tokens.
- After pre-training, one “throws away” the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing.
Token Clustering Transformer (TCFormer)
- Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

- addressing tokenisation of the image in normal grids vs clustered grids that can be reshaped such that it fits the semantic concept, and be fine grained for important areas
DINO
- Self supervised Learning
- Multi-crop idea where teacher sees only global views while the student has access to both global and local views of the transformed input image
- not as beneficial for CNNs v ViTs
- attention maps illustrate that the model automatically learns class-specific features leading to unsupervised object segmentation
MLP-Mixer

- [2105.01601] MLP-Mixer: An all-MLP Architecture for Vision
- architecture based exclusively on multi-layer perceptrons (MLPs)
- 2 types of layers:
- one with MLPs applied independently to image patches (i.e. “mixing” the per-location features)
- one with MLPs applied across patches (i.e. “mixing” spatial information).
ConvMixer

- convs to mix across channel depth
- depthwise convolutions are responsible for mixing spatial locations
- while pointwise convolution (1x1xchannels kernels) for mixing channel locations
- operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.
EfficientViT
- 2 papers titled EfficientViT
- EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
- ICCV 2023
- Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han
- Code
- Main Idea - “Multi-Scale Linear Attention module” for high-resolution dense prediction tasks
- ReLU linear attention instead of softmax attention, enhanced with convolution
- no hardware-inefficient operations
- Specific methods
- aggregate nearby tokens with small-kernel convolutions ⇒ generates multi-scale tokens
- ReLU linear attention performed on these multi-scale tokens ⇒ combines global receptive field with multi-scale learning
- insert depth-wise convolutions into FFN layers to improve local feature extraction capacity
- used for semseg, SAM
- Implemented in NVIDIA Jetson Generative AI Lab
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
- EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
MobileViT
ViT Variants (OD, SemSeg)
Detection Transformer (DETR) (OD)
- End-to-End Object Detection with Transformers
- DeTR ⇒ DINO ⇒ Grounding DINO / Mask DINO / Segment Anything ⇒ Grounding SAM, SEEM
OneFormer (Panoptic Seg, Instance Seg, Sem Seg)
Segformer
- proposed by nvidia
- semantic segmentation
CNNs since ViT (or its variants)
ConvNeXt
Questions
- what is (multi-scale) linear attention? (as per EfficientViT) ✅ 2023-12-28
- see EfficientViT
- what about DETR? ✅ 2023-12-21
Theoretical References
Papers
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , [Blogpost]
- A Survey of Visual Transformer
Articles
- AI Summer
Courses
Code References
Methods
Tools, Frameworks
- GitHub - lucidrains/vit-pytorch: Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
- Ross Wightman (Huggingface) - PyTorch Image Models (TIMM)

