Attention
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview
Introduction
-
Prerequisite to Transformers
-
In NLP, transformers and attention have been utilized successfully for many tasks
- Crucial to understand how attention emerged from NLP, since it is now SoTA for vision too
Memory is attention through time. ~ Alex Graves 2020
Attention mechanism emerged naturally from problems that deal with time-series data (sequences).
Recurrent Neural Networks (RNNs)
- Before attention was using Sequence to Sequence (Seq2Seq) to handle sequences
- Elements of the sequence called tokens (can be text, pixels, images, videos)
- ⇒ Goal: Transform input sequence (source) to new sequence (target)
- where can be any length
- RNNs were king of such tasks, since there is a preference to treat sequences sequentially

- RNN based architectures used to work very well with (Long Short-Term Memory) LSTM, (Gated Recurrent Units) GRU components
- But only for small sequences
Limitations of RNNs
- Only works well for small sequences
- intermediate representation cannot encode all information from inputs ⇒ bottleneck problem
- forgets information from timesteps further behind
- Stacked RNN layer creates vanishing gradient problem
Core Idea for Attention
- Aim to solve 1) problem with longer sequences 2) vanishing gradient problem with the idea that: The context vector z should have access to all parts of the input sequence instead of just the last one.
- i.e. form direct connection with each timestamp
Look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task.
- Attention is simply a notion of memory, gained from attending at multiple inputs through time.
Types of Attention
- Implicit vs Explicit Attention
Implicit
- DNNs already learn some implicit version of attention
- e.g. focusing on certain parts of the inputs compared to others
- Visualized by looking at partial derivatives with respect to the input
- Ideas for GradCAM?
- Jacobian Matrix
- Asking the network to ‘weigh’ its sensitivity to the input based on memory from previous inputs
- Main type of attention whenever it is mentioned
- Hard vs Soft Attention
Hard
- Can be described by discrete variables
- Non-differentiable
- cannot use gradient descent
- Train them using RL techniques such as policy gradients, REINFORCE algorithm
- but they have high variance
- Consider as a switch mechanism to determine whether to attend to a region or not, which means that the function has many abrupt changes over its domain
- Can be described by continuous variables
- Parametrized by differentiable functions
- The function varies smoothly over its domain and, as a result, it is differentiable
Broad categories of Attention Mechanisms
| Name | Definition | Citation |
|---|---|---|
| Self-Attention(&) | Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence. | Cheng2016 |
| Global/Soft | Attending to the entire input state space. | Xu2015 |
| Local/Hard | Attending to the part of input state space; i.e. a patch of the input image. | Xu2015; Luong2015 |
| (&) Also, referred to as “intra-attention” in Cheng et al., 2016 and some other papers. |
For now consider global, soft attention.
Original Attention Mechanism Formulation

In the encoder-decoder RNN case, given previous state in the decoder as and the the hidden state , we have something like this:
The index indicates the prediction step. Essentially, we define a score between the hidden state of the decoder and all the hidden states of the encoder.
More specifically, for each hidden state (denoted by ) we will calculate a scalar (known as an “alignment score” per Bahdanau 2014): Then, to form a probability distribution, and to distribute the scores far from each other, is added into a softmax such that (known as the “weights” per Bahdanau 2014) is obtained: Lastly, intermediate representation of attention (known as the “context vector” per Bahdanau 2014) is obtained: Therefore:
- Attention is defined as weighted average of values, where the weighting is a learned function
- can be thought of as data-dependent dynamic weights
But, this formulation is independent of the choice in modelling attention.
Types of Attention Mechanism Formulations
| Name | Alignment score function | Citation |
|---|---|---|
| Content-base attention | where refers to the cosine similarity metric. | Graves2014 |
| Additive(*) | which refers to the familiar neural network approach with an activation function. | Bahdanau2015 |
| Location-Base | Note: This simplifies the softmax alignment to only depend on the target position. | Luong2015 |
| General | where is a trainable weight matrix in the attention layer. | Luong2015 |
| Dot-Product | Luong2015 | |
| Scaled Dot-Product(^) | Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state. | Vaswani2017 |
| Notes: |
- The symbol denotes the predictions (previous sections used or ), while different indicate trainable matrices.
- (*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
- (^) It adds a scaling factor , motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.
Most popular till now is Additive(*) by Bahdanau:
- Parametrize attention as small fully connected neural network
- Means that attention is set of trainable weights
- And that it can be tuned with standard backpropagaion
But this method limitations:
- Computational complexity
- NN to train with weights, where is the length of input and output sentence ⇒ addressed by local attention
Local Attention
- consider only a subset of the input units/tokens
- can also be merely seen as hard attention since we need to take a hard decision first, to exclude some input units
Note: colors in the attention indicate that these weights are constantly changing while in convolution and fully connected layers they are slowly changing by gradient descent
Self-Attention
- Also, referred to as “intra-attention” in Cheng et al., 2016 and some other papers.
- Key component to Transformers (further elaborated there) and Vision Transformers (ViT)
Main Idea
- Define attention of the same sequence
- Instead of input-output sequence association, look for scores between elements of the sequence.
- Can be regarded as a (k-vertex) connected undirected weighted graph. Undirected indicates that the matrix is symmetric.

- instead of previous
- Can be computed in any trainable way
- End goal: create a meaningful representation of the sequence before transforming to another
Conclusion
Advantages of Attention
- Usually eliminates the vanishing gradient problem, as they provide direct connections between the encoder states and the decoder.
- Conceptually acts similarly as skip connections in CNNs.
- Explainability
- By inspecting the distribution of attention weights, we can gain insights into the behavior of the model, as well as to understand its limitations.
Theoretical References
Papers
Articles
- How Attention works in Deep Learning: understanding the attention mechanism in sequence models | AI Summer
- Lilian Weng - Attention? Attention!
- Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs
Courses
- DeepMind’s deep learning videos 2020 with UCL: Attention and Memory in Deep Learning, Alex Graves
