Transformers

Created: 04 Jan 2023, 05:21 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, GeneralDL

Overview

Attention - Prerequisite Knowledge to Transformers
Vision Transformers (ViT)

Transformers are the most general compute “thing” even vs MLPs

MLPs are unbiased / universal approximator since everything is connected to everything, but the connections are fixed
but transformers are even more general since those connections are computed on the fly
what does this mean? fast weights? compute on the fly means inference the values also change?

Transformer Explainer: LLM Transformer Model Visually Explained LLM Visualization

Introduction to Transformers

Fundamental building block of transformers is self-attention (briefly touched on in Attention)
Initial intention: Why not feed the entire input sequence?

Backpropagation for Transformers

yes they do still have backpropagation, can see Transformers from Scratch

Pre-processing

Tokenization

Instead of sequence of elements, there is a set of elements
- Tokenization
- Order is irrelevant
- Elements of the sequence / of the set are called tokens
After tokenization, words are projected in distributed geometrical space / build word embeddings

Word Embeddings

An embedding is a representation of a symbol in a distributed low-dimensional space of continuous-valued vectors.
Projected in a continuous euclidean space such that associations can be found between them
Depending on task, can be then pushed further or kept close

Positional Encodings

Since order was removed in tokenization, there should be a way to preserve the order
- Transformers process sequences as sets, by theory are permutation invariant
Positions are encoded by slightly altering the embeddings based on the position
- a set of small constants, which are added to the word embedding vector before the first self-attention layer
- if the same word appears in a different position, the actual representation will be slightly different, depending on where it appears in the input sentence
Original Transformer paper uses sinusoidal function for the positional encoding
- tells the model to pay attention to a particular wavelength $λ$
- Given a signal $y (x) = sin (k x)$ the wavelength will be $k = \frac{2 π}{λ}$ .
PE_{(p o s, 2 i)} =\sin \left( \frac{p o s}{ 10000^{2 i / 512}} \right) \\ P E_{(p o s, 2*i+1)} =\cos \left( \frac{p o s}{ 10000^{2 i / 512}}\right) \end{align}
- In our case the $λ$ will be dependent on the position in the sentence.
- $i$ is used to distinguish between odd and even positions.
- $512 = d_{m o d e l}$ which is the dimensionality of the embedding vectors

Transformer Encoder

original paper has $N = 6$
Note: here the layer norm is done after the skip connection (postnorm), but typically layer norm is applied before the skip connection (i.e. prenorm)

Feature-based attention: Key $K$ , Value $V$ and Query $Q$

Intuition from Information Retrieval Systems
- When you search (query $Q$ ) for a particular video, the search engine will map your query against a set of keys $K$ (video title, description, etc.) associated with possible stored videos. Then the algorithm will present you the best-matched videos (values $V$ ).
We use the keys to define the attention weights to look at the data and the values as the information that we will actually get.
Mapping query against keys requires a similarity metric: vector similarity $s im (a, b) = cos (a, b) = \frac{a b}{∣ a ∣∣ b ∣} = \frac{1}{s} * a b$
This is handled by self-attention.

Self-Attention {CORE IDEA}

“Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” ~ Ashish Vaswani et al (2017).
- “Hello I love you” for example. A trained self-attention layer will associate the word “love” with the words ‘I” and “you” with a higher weight than the word “Hello”.
In practice: Transformer has 3 representations - Queries, Keys, Values
multiplying our input $X \in R^{N \times d_{m o d e l}}$ with 3 different weight matrices $W_{Q}$ , $W_{K}$ and $W_{V} \in R^{d_{m o d e l} \times d_{k}}$ , where $N = batch \times tokens$
- Matrix mult in the original word embeddings, where dims become smaller $d_{k} < d_{m o d e l}$
- $d_{m o d e l}$ = size of embedding vector of each input element from sequence
- $d_{k}$ = inner dimension specific to each self-attention layer
- $batch$ = batch size
- $tokens$ = number of elements in sequence
Self-attention layer:
- $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V where: Q = X W_{Q} K = X W_{K} V = X W_{V} ⎭ ⎬ ⎫ R^{batch \times tokens \times d_{k}}$
- Note:
  - $Q K^{T}$ is the dot-product attention method of calculation for the scoring function to represent correlation between two words (i.e. attention weight)
    - Other similarity functions can also be used instead
  - $d_{k}$ is used as a scaling factor to prevent exploding gradients
  - $\frac{Q K ^{T}}{d _{k}}$ finds the similarity of search query with the key in database
  - $softmax$ is used to get final attention weights as probability distribution ⇒ row-wise softmax normalisation
  - since $K$ is distinguished from $V$ as distinct representations, $softmax (\frac{Q K ^{T}}{d _{k}})$ is multiplied with $V$ .
  - $d_{m o d e l}$ = size of embedding vector of each input element from sequence
  - $d_{k}$ = inner dimension specific to each self-attention layer
  - $batch$ = batch size
  - $tokens$ = number of elements in sequence
- Self-attention matrix refers to: $softmax (\frac{Q K ^{T}}{d _{k}})$
  - “where to look”
- Value Matrix: $V$
  - “what I want to get”
- Similar to vector similarity calculations, but using matrices and scaling by matrix size $d_{k}$
Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix ⇒ matrix calculation for $Q K^{T}$
Self-attention is NOT symmetric
- $\frac{Q K ^{T}}{d _{k}} = \frac{X W _{Q} ( X W _{K} ) ^{T}}{d _{k}} = \frac{X W _{Q} W _{K}^{T} X ^{T}}{d _{k}}$
- in order to make self-attention symmetric, then $W_{Q}$ must $= W_{K}$
  - since when you multiply a matrix with its transpose you get a symmetric matrix
- therefore, some papers use one shared projection matrix for $Q$ and $K$ ??
- why would you want self-attention to be symmetric?

Skip Connections

gives a transformer a tiny ability to allow the representations of different levels of processing to interact
“pass” our higher-level understanding of the last layers to the previous layers
same as per resnet ideas see here for more details

Layer Normalization

mean and variance are computed across channels and spatial dimensions $μ_{n} = \frac{1}{K} \sum_{k = 1}^{K} x_{nk}$ $σ_{n}^{2} = \frac{1}{K} \sum_{k = 1}^{K} (x_{nk} - μ_{n})^{2}$ $\overset{x}{^}_{nk} = \frac{x _{nk} - μ _{n}}{σ _{n}^{2} + ϵ}, \overset{x}{^}_{nk} \in R$ $LN_{γ, β} (x_{n}) = γ \overset{x}{^}_{n} + β, x_{n} \in R^{K}$

Linear Layer (MLP)

Linear layer (PyTorch) = Dense layer (Keras) = feed-forward layer (ML)
- $y = xW^{T} + b$
- $W$ is matrix, $y, x, b$ are vectors
For the original Transformer, they did 2 Linear layers with dropout and non-linearities i.e. a Multilayer Perceptron (MLP)
Intention is to project ouptut from self-attention into higher dimensional space
- Solves bad initialization, rank collapse

Multi-head Attention, Parallel Implementation {CORE IDEA}

Benefits

Parallelism
- Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix ⇒ matrix calculation for $Q K^{T}$
- Multi-head attention
Allow for attending to different parts of sequence differently each time
Can capture positional information since head attends to different segments
Each head captures different contextual info by correlating words uniquely “allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”

How?

Run through attention mechanism several times
Allows for multiple independent paths to understand input
Each time mapping independent set of $K$ , $Q$ , $V$ matrices into different lower dim spaces, then compute attention there (output is a “head”)

\operatorname{MultiHead}(\textbf{Q},\textbf{K},\textbf{V}) &= \operatorname{Concat}(\text{head}_1,\cdots,\text{head}_h) \textbf{W}^O \\ \text{where:} \text{ head}_i &= \operatorname{Attention}(\textbf{Q}\textbf{W}_i^Q,\textbf{K}\textbf{W}_i^K,\textbf{V}\textbf{W}_i^V) \\ \\ \text{and where}&: \textbf{W}_i^Q,\textbf{W}_i^K,\textbf{W}_i^V \in R^{d_{model} \times d_k} \\ &: \textbf{W}^O \in R^{d_{model} \times d_{model}} \\ &: d_{model} = hd_k \end{align}

Mapping is achieved by multiplying each matrix with a separate weight matrix, denoted as: $W_{i}^{K} \in R^{d_{m o d e l} \times d_{k}}$ , $W_{i}^{Q} \in R^{d_{m o d e l} \times d_{k}}$ , and $W_{i}^{V} \in R^{d_{m o d e l} \times d_{k}}$ .
Output vector size divided by number of heads (is this done in $W^{O}$ ?)
- compensates for extra complexity
Heads are concatenated, transformed using square weight matrix $W^{O} \in R^{d_{m o d e l} \times d_{m o d e l}}$ since $d_{m o d e l} = h d_{k}$
Parallelism
- Since heads are independent from each other, we can perform the self-attention computation in parallel on different workers

Transformer Decoder

contains all the above components as per Encoder, but has 2 novel parts per below
output probabilities predict next token in output sentence
- probability assigned to each word in french, keep highest score word

Masked Multi-head Attention

in decoder, predict 1 word (token) sequentially
self-attention needs to be modified to consider only the output sentence that has been generated so far ⇒ don’t know whole sentence since hasn’t been produced
to disregard unknown words ⇒ must mask next word embeddings (by setting as $- in f$ )

\operatorname{MaskedAttention}(\textbf{Q}, \textbf{K}, \textbf{V})=\operatorname{softmax}\left(\frac{\textbf{Q} \textbf{K}^{T} + \textbf{M} }{\sqrt{d_{k}}}\right) \textbf{V} \\ \text{where } \textbf{M} \text{ consists of zeros and } -\inf \end{align}$$ - This removes the corresponding connection, and then runs as per Multi-head Attention #### Encoder-Decoder Attention / Cross-Attention {CORE IDEA} ![[Pasted image 20231208174951.png | 300]] ![[Pasted image 20231220144051.png|300]] - Decoder processes the encoded representation --> attention matrix generated by the encoder is passed to another attention layer alongside the result of the previous Masked Multi-head attention block (output of the last block of the encoder will be used in each decoder block) - Intuition: combine input and output sentence - encoder’s output encapsulates the final embedding of the input sentence --> used to produce Key and Value matrices - masked multi-head attention block output contains the so far generated new sentence --> represented as the Query matrix - encoder-decoder attention / cross-attention / cross attention is trained to associate the input sentence with the corresponding output word --> eventually learn the mapping ### Why does it work well - Distributed, independent representations at each block - when multiple heads, allows for capturing different features of the input - Meaning heavily depends on context - Matching between the NLP task and the self-attention design - No notion of locality, since model is making the global associations - Multiple encoder and decoder blocks - More layers -> more abstract representations - Similar to [receptive field](https://theaisummer.com/receptive-field/) idea in CNNs but in terms of pairs - Combination of high and low-level information - Using skip connections, top-down understanding can flow back with multiple gradient paths that flow backward ### Self-attention VS linear layers VS convolutions - values of the self-attention weights are computed on the fly - data-dependent dynamic weights because they change dynamically in response to the data (fast weights). - weights of a feedforward (linear) layer change very slowly with SGD - weights (slow weights) in convolutions are restricted to have fixed size via kernel size ### Benefits 1. Parallel Processing - Transformers allow for parallel processing of input sequences, speeding up training and inference in neural networks. - Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix --> matrix calculation for $\textbf{Q}\textbf{K}^T$ - Multi-head attention 1. Long-range Dependencies - They excel in capturing long-range dependencies in sequences, making them effective for tasks involving contextual understanding like language translation and text generation. 2. Self-attention Mechanism - Transformers use the self-attention mechanism, allowing them to weigh different parts of the input sequence differently, focusing on relevant information for each output token. Contextual understanding. 3. Scalability - They're more scalable compared to recurrent neural networks (RNNs) and can handle longer sequences without memory constraints. ### Limitations 1. High computational and memory demands - Transformers can be computationally expensive, especially with large models and huge datasets, requiring substantial computational resources. - In self-attention, all pairs of interactions between words need to be computed, which meant computation grew **quadratically** with the sequence length requiring significant memory and training times 2. Attention Overheads - For longer sequences, the attention mechanism becomes computationally intensive due to the **quadratic relationship** between sequence length and computation, affecting efficiency. 3. Data Efficiency - Transformers might require larger amounts of data for effective training compared to some other models, making them more data-hungry in certain scenarios. 4. Interpretability - The attention mechanism in transformers, while powerful, can sometimes lack interpretability, making it challenging to understand how specific decisions are made. ### Addressing limitations 1. Mainly targets quadratic complexity issue in self-attention, or reducing compute cost - See [[Efficient Transformers]] ### Questions - [ ] fine tuning methods? pretrain then finetune? why LoRA so popular? - [ ] what about encoder only, decoder only transformers? - encoder only - like [[Vision Transformers (ViT)]]! - decoder only? - [x] what about MoEs? what are they? routing MoEs? ✅ 2023-12-22 - see [[Mixture of Experts (MoE)]] --- ## Theoretical References ### Papers - [Vaswani (2017) Attention Is All You Need](https://arxiv.org/abs/1706.03762) - [Attention Is All You Need - YouTube](https://www.youtube.com/watch?v=iDulhoQ2pro) ### Articles - AI Summer 1. [How Transformers work in deep learning and NLP: an intuitive introduction | AI Summer](https://theaisummer.com/transformer/) 2. [Why multi-head self attention works: math, intuitions and 10+1 hidden insights | AI Summer](https://theaisummer.com/self-attention/) - [The Transformer Family Version 2.0 | Lil'Log](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/) - [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) - [The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.](https://jalammar.github.io/illustrated-transformer/) - [Transformers from Scratch](https://e2eml.school/transformers.html) ### Courses - DeepMind’s deep learning videos 2020 with UCL, Lecture: [Deep Learning for Natural Language Processing](https://www.youtube.com/watch?v=8zAP2qWAsKg&t=2410s&ab_channel=DeepMind), Felix Hill --- ## Code References ### Methods - ### Tools, Frameworks - [GitHub - The-AI-Summer/self-attention-cv: Implementation of various self-attention mechanisms focused on computer vision. Ongoing repository.](https://github.com/The-AI-Summer/self-attention-cv)

Darius Knowledge Hub

Explorer

Transformers

Transformers