Attention

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Overview

Introduction

Prerequisite to Transformers
In NLP, transformers and attention have been utilized successfully for many tasks
- Crucial to understand how attention emerged from NLP, since it is now SoTA for vision too
Memory is attention through time. ~ Alex Graves 2020

Attention mechanism emerged naturally from problems that deal with time-series data (sequences).

Recurrent Neural Networks (RNNs)

Before attention was using Sequence to Sequence (Seq2Seq) to handle sequences
Elements of the sequence called tokens (can be text, pixels, images, videos)
⇒ Goal: Transform input sequence (source) to new sequence (target)
- where can be any length
RNNs were king of such tasks, since there is a preference to treat sequences sequentially
RNN based architectures used to work very well with (Long Short-Term Memory) LSTM, (Gated Recurrent Units) GRU components
But only for small sequences

Limitations of RNNs

Only works well for small sequences
- intermediate representation $z$ cannot encode all information from inputs ⇒ bottleneck problem
- forgets information from timesteps further behind
Stacked RNN layer creates vanishing gradient problem

Core Idea for Attention

Aim to solve 1) problem with longer sequences 2) vanishing gradient problem with the idea that: The context vector z should have access to all parts of the input sequence instead of just the last one.
i.e. form direct connection with each timestamp

Look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task.

Attention is simply a notion of memory, gained from attending at multiple inputs through time.

Types of Attention

Implicit vs Explicit Attention Implicit
- DNNs already learn some implicit version of attention
- e.g. focusing on certain parts of the inputs compared to others
- Visualized by looking at partial derivatives with respect to the input
  - Ideas for GradCAM?
  - Jacobian Matrix
Explicit
- Asking the network to ‘weigh’ its sensitivity to the input based on memory from previous inputs
- Main type of attention whenever it is mentioned
Hard vs Soft Attention Hard
- Can be described by discrete variables
- Non-differentiable
  - cannot use gradient descent
- Train them using RL techniques such as policy gradients, REINFORCE algorithm
  - but they have high variance
- Consider as a switch mechanism to determine whether to attend to a region or not, which means that the function has many abrupt changes over its domain
Soft
- Can be described by continuous variables
- Parametrized by differentiable functions
  - The function varies smoothly over its domain and, as a result, it is differentiable
Given the full sequence is available, can mainly consider the soft attention instead, such that we can use backprop.

Broad categories of Attention Mechanisms

Name	Definition	Citation
Self-Attention(&)	Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence.	Cheng2016
Global/Soft	Attending to the entire input state space.	Xu2015
Local/Hard	Attending to the part of input state space; i.e. a patch of the input image.	Xu2015; Luong2015
(&) Also, referred to as “intra-attention” in Cheng et al., 2016 and some other papers.

For now consider global, soft attention.

Original Attention Mechanism Formulation

In the encoder-decoder RNN case, given previous state in the decoder as $y_{i - 1}$ and the the hidden state $h = h_{1}, h_{2}, h_{n}$ , we have something like this:

e_{i} = attentio n_{net} (y_{i - 1}, h) \in R^{n}

The index $i$ indicates the prediction step. Essentially, we define a score between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by $j$ ) $h_{1}, h_{2}, h_{n}$ we will calculate a scalar (known as an “alignment score” per Bahdanau 2014): $e_{ij} = attentio n_{net} (y_{i - 1}, h_{j})$ Then, to form a probability distribution, and to distribute the scores far from each other, $e_{ij}$ is added into a softmax such that $α_{ij}$ (known as the “weights” per Bahdanau 2014) is obtained: $α_{ij} = \frac{e x p ( e _{ij} )}{\sum _{k = 1}^{T_{x}} e x p ( e _{ik} )} = softmax (e_{ij})$ Lastly, intermediate representation of attention $z_{i}$ (known as the “context vector” per Bahdanau 2014) is obtained: $z_{i} = \sum_{j = 1}^{T} α_{ij} h_{j}$ Therefore:

Attention $z_{i}$ is defined as weighted average of values, where the weighting is a learned function
$α_{ij}$ can be thought of as data-dependent dynamic weights

But, this formulation is independent of the choice in modelling attention.

Types of Attention Mechanism Formulations

Name	Alignment score function	Citation
Content-base attention	$score (s_{t}, h_{i}) = cosine [s_{t}, h_{i}]$ where $cosine$ refers to the cosine similarity metric.	Graves2014
Additive(*)	$score (s_{t}, h_{i}) = v_{a}^{⊤} tanh (W_{a} [s_{t - 1}; h_{i}])$ which refers to the familiar neural network approach with an activation function.	Bahdanau2015
Location-Base	$α_{t, i} = softmax (W_{a} s_{t})$ Note: This simplifies the softmax alignment to only depend on the target position.	Luong2015
General	$score (s_{t}, h_{i}) = s_{t}^{⊤} W_{a} h_{i}$ where $W_{a}$ is a trainable weight matrix in the attention layer.	Luong2015
Dot-Product	$score (s_{t}, h_{i}) = s_{t}^{⊤} h_{i}$	Luong2015
Scaled Dot-Product(^)	$score (s_{t}, h_{i}) = \frac{s _{t}^{⊤} h _{i}}{n}$ Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.	Vaswani2017
Notes:

The symbol $s_{t}$ denotes the predictions (previous sections used $y_{t}$ or $y_{i - 1}$ ), while different $W$ indicate trainable matrices.
(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
(^) It adds a scaling factor $1/ n$ , motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

Most popular till now is Additive(*) by Bahdanau:

Parametrize attention as small fully connected neural network
Means that attention is set of trainable weights
And that it can be tuned with standard backpropagaion

But this method limitations:

Computational complexity
- NN to train with $O (T^{2})$ weights, where $T$ is the length of input and output sentence ⇒ addressed by local attention

Local Attention

consider only a subset of the input units/tokens
can also be merely seen as hard attention since we need to take a hard decision first, to exclude some input units Note: colors in the attention indicate that these weights are constantly changing while in convolution and fully connected layers they are slowly changing by gradient descent

Self-Attention

Also, referred to as “intra-attention” in Cheng et al., 2016 and some other papers.
Key component to Transformers (further elaborated there) and Vision Transformers (ViT)

Main Idea

Define attention of the same sequence
Instead of input-output sequence association, look for scores between elements of the sequence.
Can be regarded as a (k-vertex) connected undirected weighted graph. Undirected indicates that the matrix is symmetric.
$self-attention_{n e t} (x, x)$ instead of previous $attentio n_{net} (y_{i - 1}, h)$
Can be computed in any trainable way
End goal: create a meaningful representation of the sequence before transforming to another

Conclusion

Advantages of Attention

Usually eliminates the vanishing gradient problem, as they provide direct connections between the encoder states and the decoder.
- Conceptually acts similarly as skip connections in CNNs.
Explainability
- By inspecting the distribution of attention weights, we can gain insights into the behavior of the model, as well as to understand its limitations.

Theoretical References

Papers

Articles

Courses

DeepMind’s deep learning videos 2020 with UCL: Attention and Memory in Deep Learning, Alex Graves

Darius Knowledge Hub

Explorer

Attention

Attention

Overview

Introduction

Recurrent Neural Networks (RNNs)

Limitations of RNNs

Core Idea for Attention

Types of Attention

Broad categories of Attention Mechanisms

Original Attention Mechanism Formulation

Types of Attention Mechanism Formulations

Local Attention

Self-Attention

Main Idea

Conclusion

Advantages of Attention

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks