Transformers


Created: 04 Jan 2023, 05:21 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, GeneralDL


Overview

Transformers are the most general compute “thing” even vs MLPs

  • MLPs are unbiased / universal approximator since everything is connected to everything, but the connections are fixed
  • but transformers are even more general since those connections are computed on the fly
  • what does this mean? fast weights? compute on the fly means inference the values also change?

Transformer Explainer: LLM Transformer Model Visually Explained LLM Visualization

Introduction to Transformers

  • Fundamental building block of transformers is self-attention (briefly touched on in Attention)
  • Initial intention: Why not feed the entire input sequence?

Backpropagation for Transformers

Pre-processing

Tokenization

  • Instead of sequence of elements, there is a set of elements
    • Tokenization
    • Order is irrelevant
    • Elements of the sequence / of the set are called tokens
  • After tokenization, words are projected in distributed geometrical space / build word embeddings

Word Embeddings

  • An embedding is a representation of a symbol in a distributed low-dimensional space of continuous-valued vectors.
  • Projected in a continuous euclidean space such that associations can be found between them
  • Depending on task, can be then pushed further or kept close

Positional Encodings

  • Since order was removed in tokenization, there should be a way to preserve the order
    • Transformers process sequences as sets, by theory are permutation invariant
  • Positions are encoded by slightly altering the embeddings based on the position
    • a set of small constants, which are added to the word embedding vector before the first self-attention layer
    • if the same word appears in a different position, the actual representation will be slightly different, depending on where it appears in the input sentence
  • Original Transformer paper uses sinusoidal function for the positional encoding
    • tells the model to pay attention to a particular wavelength 
    • Given a signal  the wavelength will be ​.
    PE_{(p o s, 2 i)} =\sin \left( \frac{p o s}{ 10000^{2 i / 512}} \right) \\ P E_{(p o s, 2*i+1)} =\cos \left( \frac{p o s}{ 10000^{2 i / 512}}\right) \end{align}
    • In our case the  will be dependent on the position in the sentence. 
    •  is used to distinguish between odd and even positions.
    • which is the dimensionality of the embedding vectors

Transformer Encoder

  • original paper has
  • Note: here the layer norm is done after the skip connection (postnorm), but typically layer norm is applied before the skip connection (i.e. prenorm)

Feature-based attention: Key , Value and Query

  • Intuition from Information Retrieval Systems
    • When you search (query ) for a particular video, the search engine will map your query against a set of keys  (video title, description, etc.) associated with possible stored videos. Then the algorithm will present you the best-matched videos (values ).
  • We use the keys to define the attention weights to look at the data and the values as the information that we will actually get.
  • Mapping query against keys requires a similarity metric: vector similarity
  • This is handled by self-attention.

Self-Attention {CORE IDEA}

  • “Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” ~ Ashish Vaswani et al (2017).

    • “Hello I love you” for example. A trained self-attention layer will associate the word “love” with the words ‘I” and “you” with a higher weight than the word “Hello”.
  • In practice: Transformer has 3 representations - Queries, Keys, Values

  • multiplying our input ​ with 3 different weight matrices ​, and  , where

    • Matrix mult in the original word embeddings, where dims become smaller
    • = size of embedding vector of each input element from sequence
    • = inner dimension specific to each self-attention layer
    • = batch size
    • = number of elements in sequence
  • Self-attention layer:

    • Note:
      • is the dot-product attention method of calculation for the scoring function to represent correlation between two words (i.e. attention weight)
        • Other similarity functions can also be used instead
      • is used as a scaling factor to prevent exploding gradients
      • finds the similarity of search query with the key in database
      • is used to get final attention weights as probability distribution row-wise softmax normalisation
      • since is distinguished from as distinct representations, is multiplied with .
      • = size of embedding vector of each input element from sequence
      • = inner dimension specific to each self-attention layer
      • = batch size
      • = number of elements in sequence
    • Self-attention matrix refers to:
      • “where to look”
    • Value Matrix:
      • “what I want to get”
    • Similar to vector similarity calculations, but using matrices and scaling by matrix size
  • Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix matrix calculation for

  • Self-attention is NOT symmetric

    • in order to make self-attention symmetric, then must
      • since when you multiply a matrix with its transpose you get a symmetric matrix
    • therefore, some papers use one shared projection matrix for and ??
    • why would you want self-attention to be symmetric?

Skip Connections

  • gives a transformer a tiny ability to allow the representations of different levels of processing to interact
  • “pass” our higher-level understanding of the last layers to the previous layers
  • same as per resnet ideas see here for more details

Layer Normalization

  • mean and variance are computed across channels and spatial dimensions

Linear Layer (MLP)

  • Linear layer (PyTorch) = Dense layer (Keras) = feed-forward layer (ML)
    • is matrix, are vectors
  • For the original Transformer, they did 2 Linear layers with dropout and non-linearities i.e. a Multilayer Perceptron (MLP)
  • Intention is to project ouptut from self-attention into higher dimensional space
    • Solves bad initialization, rank collapse

Multi-head Attention, Parallel Implementation {CORE IDEA}

Benefits
  • Parallelism
    • Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix matrix calculation for
    • Multi-head attention
  • Allow for attending to different parts of sequence differently each time
  • Can capture positional information since head attends to different segments
  • Each head captures different contextual info by correlating words uniquely “allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”
How?
  • Run through attention mechanism several times
  • Allows for multiple independent paths to understand input
  • Each time mapping independent set of , , matrices into different lower dim spaces, then compute attention there (output is a “head”)
\operatorname{MultiHead}(\textbf{Q},\textbf{K},\textbf{V}) &= \operatorname{Concat}(\text{head}_1,\cdots,\text{head}_h) \textbf{W}^O \\ \text{where:} \text{ head}_i &= \operatorname{Attention}(\textbf{Q}\textbf{W}_i^Q,\textbf{K}\textbf{W}_i^K,\textbf{V}\textbf{W}_i^V) \\ \\ \text{and where}&: \textbf{W}_i^Q,\textbf{W}_i^K,\textbf{W}_i^V \in R^{d_{model} \times d_k} \\ &: \textbf{W}^O \in R^{d_{model} \times d_{model}} \\ &: d_{model} = hd_k \end{align}
  • Mapping is achieved by multiplying each matrix with a separate weight matrix, denoted as: , , and .
  • Output vector size divided by number of heads (is this done in ?)
    • compensates for extra complexity
  • Heads are concatenated, transformed using square weight matrix since
  • Parallelism
    • Since heads are independent from each other, we can perform the self-attention computation in parallel on different workers

Transformer Decoder

  • contains all the above components as per Encoder, but has 2 novel parts per below
  • output probabilities predict next token in output sentence
    • probability assigned to each word in french, keep highest score word

Masked Multi-head Attention

  • in decoder, predict 1 word (token) sequentially
  • self-attention needs to be modified to consider only the output sentence that has been generated so far don’t know whole sentence since hasn’t been produced
  • to disregard unknown words must mask next word embeddings (by setting as )
\operatorname{MaskedAttention}(\textbf{Q}, \textbf{K}, \textbf{V})=\operatorname{softmax}\left(\frac{\textbf{Q} \textbf{K}^{T} + \textbf{M} }{\sqrt{d_{k}}}\right) \textbf{V} \\ \text{where } \textbf{M} \text{ consists of zeros and } -\inf \end{align}$$ - This removes the corresponding connection, and then runs as per Multi-head Attention #### Encoder-Decoder Attention / Cross-Attention {CORE IDEA} ![[Pasted image 20231208174951.png | 300]] ![[Pasted image 20231220144051.png|300]] - Decoder processes the encoded representation --> attention matrix generated by the encoder is passed to another attention layer alongside the result of the previous Masked Multi-head attention block (output of the last block of the encoder will be used in each decoder block) - Intuition: combine input and output sentence - encoder’s output encapsulates the final embedding of the input sentence --> used to produce Key and Value matrices - masked multi-head attention block output contains the so far generated new sentence --> represented as the Query matrix - encoder-decoder attention / cross-attention / cross attention is trained to associate the input sentence with the corresponding output word --> eventually learn the mapping ### Why does it work well - Distributed, independent representations at each block - when multiple heads, allows for capturing different features of the input - Meaning heavily depends on context - Matching between the NLP task and the self-attention design - No notion of locality, since model is making the global associations - Multiple encoder and decoder blocks - More layers -> more abstract representations - Similar to [receptive field](https://theaisummer.com/receptive-field/) idea in CNNs but in terms of pairs - Combination of high and low-level information - Using skip connections, top-down understanding can flow back with multiple gradient paths that flow backward ### Self-attention VS linear layers VS convolutions - values of the self-attention weights are computed on the fly - data-dependent dynamic weights because they change dynamically in response to the data (fast weights). - weights of a feedforward (linear) layer change very slowly with SGD - weights (slow weights) in convolutions are restricted to have fixed size via kernel size ### Benefits 1. Parallel Processing - Transformers allow for parallel processing of input sequences, speeding up training and inference in neural networks. - Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix --> matrix calculation for $\textbf{Q}\textbf{K}^T$ - Multi-head attention 1. Long-range Dependencies - They excel in capturing long-range dependencies in sequences, making them effective for tasks involving contextual understanding like language translation and text generation. 2. Self-attention Mechanism - Transformers use the self-attention mechanism, allowing them to weigh different parts of the input sequence differently, focusing on relevant information for each output token. Contextual understanding. 3. Scalability - They're more scalable compared to recurrent neural networks (RNNs) and can handle longer sequences without memory constraints. ### Limitations 1. High computational and memory demands - Transformers can be computationally expensive, especially with large models and huge datasets, requiring substantial computational resources. - In self-attention, all pairs of interactions between words need to be computed, which meant computation grew **quadratically** with the sequence length requiring significant memory and training times 2. Attention Overheads - For longer sequences, the attention mechanism becomes computationally intensive due to the **quadratic relationship** between sequence length and computation, affecting efficiency. 3. Data Efficiency - Transformers might require larger amounts of data for effective training compared to some other models, making them more data-hungry in certain scenarios. 4. Interpretability - The attention mechanism in transformers, while powerful, can sometimes lack interpretability, making it challenging to understand how specific decisions are made. ### Addressing limitations 1. Mainly targets quadratic complexity issue in self-attention, or reducing compute cost - See [[Efficient Transformers]] ### Questions - [ ] fine tuning methods? pretrain then finetune? why LoRA so popular? - [ ] what about encoder only, decoder only transformers? - encoder only - like [[Vision Transformers (ViT)]]! - decoder only? - [x] what about MoEs? what are they? routing MoEs? ✅ 2023-12-22 - see [[Mixture of Experts (MoE)]] --- ## Theoretical References ### Papers - [Vaswani (2017) Attention Is All You Need](https://arxiv.org/abs/1706.03762) - [Attention Is All You Need - YouTube](https://www.youtube.com/watch?v=iDulhoQ2pro) ### Articles - AI Summer 1. [How Transformers work in deep learning and NLP: an intuitive introduction | AI Summer](https://theaisummer.com/transformer/) 2. [Why multi-head self attention works: math, intuitions and 10+1 hidden insights | AI Summer](https://theaisummer.com/self-attention/) - [The Transformer Family Version 2.0 | Lil'Log](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/) - [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) - [The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.](https://jalammar.github.io/illustrated-transformer/) - [Transformers from Scratch](https://e2eml.school/transformers.html) ### Courses - DeepMind’s deep learning videos 2020 with UCL, Lecture: [Deep Learning for Natural Language Processing](https://www.youtube.com/watch?v=8zAP2qWAsKg&t=2410s&ab_channel=DeepMind), Felix Hill --- ## Code References ### Methods - ### Tools, Frameworks - [GitHub - The-AI-Summer/self-attention-cv: Implementation of various self-attention mechanisms focused on computer vision. Ongoing repository.](https://github.com/The-AI-Summer/self-attention-cv)