Transformers
Created: 04 Jan 2023, 05:21 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, GeneralDL
Overview
Related fields
- Attention - Prerequisite Knowledge to Transformers
- Vision Transformers (ViT)
Transformers are the most general compute “thing” even vs MLPs
- MLPs are unbiased / universal approximator since everything is connected to everything, but the connections are fixed
- but transformers are even more general since those connections are computed on the fly
- what does this mean? fast weights? compute on the fly means inference the values also change?
Transformer Explainer: LLM Transformer Model Visually Explained LLM Visualization
Introduction to Transformers

- Fundamental building block of transformers is self-attention (briefly touched on in Attention)
- Initial intention: Why not feed the entire input sequence?
Backpropagation for Transformers
- yes they do still have backpropagation, can see Transformers from Scratch
Pre-processing
Tokenization
- Instead of sequence of elements, there is a set of elements
- Tokenization

- Order is irrelevant
- Elements of the sequence / of the set are called tokens
- Tokenization
- After tokenization, words are projected in distributed geometrical space / build word embeddings
Word Embeddings
- An embedding is a representation of a symbol in a distributed low-dimensional space of continuous-valued vectors.
- Projected in a continuous euclidean space such that associations can be found between them
- Depending on task, can be then pushed further or kept close
Positional Encodings
- Since order was removed in tokenization, there should be a way to preserve the order
- Transformers process sequences as sets, by theory are permutation invariant
- Positions are encoded by slightly altering the embeddings based on the position
- a set of small constants, which are added to the word embedding vector before the first self-attention layer

- if the same word appears in a different position, the actual representation will be slightly different, depending on where it appears in the input sentence
- a set of small constants, which are added to the word embedding vector before the first self-attention layer
- Original Transformer paper uses sinusoidal function for the positional encoding
- tells the model to pay attention to a particular wavelength
- Given a signal the wavelength will be .
- In our case the will be dependent on the position in the sentence.
- is used to distinguish between odd and even positions.
- which is the dimensionality of the embedding vectors
Transformer Encoder

- original paper has
- Note: here the layer norm is done after the skip connection (postnorm), but typically layer norm is applied before the skip connection (i.e. prenorm)
Feature-based attention: Key , Value and Query
- Intuition from Information Retrieval Systems
- When you search (query ) for a particular video, the search engine will map your query against a set of keys (video title, description, etc.) associated with possible stored videos. Then the algorithm will present you the best-matched videos (values ).
- We use the keys to define the attention weights to look at the data and the values as the information that we will actually get.
- Mapping query against keys requires a similarity metric: vector similarity
- This is handled by self-attention.
Self-Attention {CORE IDEA}
-
“Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” ~ Ashish Vaswani et al (2017).
- “Hello I love you” for example. A trained self-attention layer will associate the word “love” with the words ‘I” and “you” with a higher weight than the word “Hello”.

- “Hello I love you” for example. A trained self-attention layer will associate the word “love” with the words ‘I” and “you” with a higher weight than the word “Hello”.
-
In practice: Transformer has 3 representations - Queries, Keys, Values

-
multiplying our input with 3 different weight matrices , and , where
- Matrix mult in the original word embeddings, where dims become smaller
- = size of embedding vector of each input element from sequence
- = inner dimension specific to each self-attention layer
- = batch size
- = number of elements in sequence
-
Self-attention layer:
- Note:
- is the dot-product attention method of calculation for the scoring function to represent correlation between two words (i.e. attention weight)
- Other similarity functions can also be used instead
- is used as a scaling factor to prevent exploding gradients
- finds the similarity of search query with the key in database
- is used to get final attention weights as probability distribution ⇒ row-wise softmax normalisation
- since is distinguished from as distinct representations, is multiplied with .
- = size of embedding vector of each input element from sequence
- = inner dimension specific to each self-attention layer
- = batch size
- = number of elements in sequence
- is the dot-product attention method of calculation for the scoring function to represent correlation between two words (i.e. attention weight)
- Self-attention matrix refers to:
- “where to look”
- Value Matrix:
- “what I want to get”
- Similar to vector similarity calculations, but using matrices and scaling by matrix size
-
Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix ⇒ matrix calculation for
-
Self-attention is NOT symmetric
- in order to make self-attention symmetric, then must
- since when you multiply a matrix with its transpose you get a symmetric matrix

- since when you multiply a matrix with its transpose you get a symmetric matrix
- therefore, some papers use one shared projection matrix for and ??
- why would you want self-attention to be symmetric?
Skip Connections
- gives a transformer a tiny ability to allow the representations of different levels of processing to interact
- “pass” our higher-level understanding of the last layers to the previous layers
- same as per resnet ideas see here for more details
Layer Normalization
- mean and variance are computed across channels and spatial dimensions

Linear Layer (MLP)
- Linear layer (PyTorch) = Dense layer (Keras) = feed-forward layer (ML)
- is matrix, are vectors
- For the original Transformer, they did 2 Linear layers with dropout and non-linearities i.e. a Multilayer Perceptron (MLP)
- Intention is to project ouptut from self-attention into higher dimensional space
- Solves bad initialization, rank collapse

- Solves bad initialization, rank collapse
Multi-head Attention, Parallel Implementation {CORE IDEA}

Benefits
- Parallelism
- Self-attention alone is already intrinsically parallel due to batching embedding vectors within the query matrix ⇒ matrix calculation for
- Multi-head attention
- Allow for attending to different parts of sequence differently each time
- Can capture positional information since head attends to different segments
- Each head captures different contextual info by correlating words uniquely “allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”
How?
- Run through attention mechanism several times
- Allows for multiple independent paths to understand input
- Each time mapping independent set of , , matrices into different lower dim spaces, then compute attention there (output is a “head”)
- Mapping is achieved by multiplying each matrix with a separate weight matrix, denoted as: , , and .
- Output vector size divided by number of heads (is this done in ?)
- compensates for extra complexity
- Heads are concatenated, transformed using square weight matrix since
- Parallelism
- Since heads are independent from each other, we can perform the self-attention computation in parallel on different workers
Transformer Decoder

- contains all the above components as per Encoder, but has 2 novel parts per below
- output probabilities predict next token in output sentence
- probability assigned to each word in french, keep highest score word
Masked Multi-head Attention
- in decoder, predict 1 word (token) sequentially
- self-attention needs to be modified to consider only the output sentence that has been generated so far ⇒ don’t know whole sentence since hasn’t been produced
- to disregard unknown words ⇒ must mask next word embeddings (by setting as )