Attention


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

  • Prerequisite to Transformers

  • In NLP, transformers and attention have been utilized successfully for many tasks

    • Crucial to understand how attention emerged from NLP, since it is now SoTA for vision too

    Memory is attention through time. ~ Alex Graves 2020

Attention mechanism emerged naturally from problems that deal with time-series data (sequences).

Recurrent Neural Networks (RNNs)

  • Before attention was using Sequence to Sequence (Seq2Seq) to handle sequences
  • Elements of the sequence called tokens (can be text, pixels, images, videos)
  • Goal: Transform input sequence (source) to new sequence (target)
    • where can be any length
  • RNNs were king of such tasks, since there is a preference to treat sequences sequentially
  • RNN based architectures used to work very well with (Long Short-Term Memory) LSTM, (Gated Recurrent Units) GRU components
  • But only for small sequences
Limitations of RNNs
  • Only works well for small sequences
    • intermediate representation cannot encode all information from inputs bottleneck problem
    • forgets information from timesteps further behind
  • Stacked RNN layer creates vanishing gradient problem

Core Idea for Attention

  • Aim to solve 1) problem with longer sequences 2) vanishing gradient problem with the idea that: The context vector z should have access to all parts of the input sequence instead of just the last one.
  • i.e. form direct connection with each timestamp

Look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task.

  • Attention is simply a notion of memory, gained from attending at multiple inputs through time.

Types of Attention

  1. Implicit vs Explicit Attention Implicit
    • DNNs already learn some implicit version of attention
    • e.g. focusing on certain parts of the inputs compared to others
    • Visualized by looking at partial derivatives with respect to the input
      • Ideas for GradCAM?
      • Jacobian Matrix
    Explicit
    • Asking the network to ‘weigh’ its sensitivity to the input based on memory from previous inputs
    • Main type of attention whenever it is mentioned
  2. Hard vs Soft Attention Hard
    • Can be described by discrete variables
    • Non-differentiable
      • cannot use gradient descent
    • Train them using RL techniques such as policy gradients, REINFORCE algorithm
      • but they have high variance
    • Consider as a switch mechanism to determine whether to attend to a region or not, which means that the function has many abrupt changes over its domain
    Soft
    • Can be described by continuous variables
    • Parametrized by differentiable functions
      • The function varies smoothly over its domain and, as a result, it is differentiable
    Given the full sequence is available, can mainly consider the soft attention instead, such that we can use backprop.
Broad categories of Attention Mechanisms
NameDefinitionCitation
Self-Attention(&)Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence.Cheng2016
Global/SoftAttending to the entire input state space.Xu2015
Local/HardAttending to the part of input state space; i.e. a patch of the input image.Xu2015Luong2015
(&) Also, referred to as “intra-attention” in Cheng et al., 2016 and some other papers.

For now consider global, soft attention.

Original Attention Mechanism Formulation

In the encoder-decoder RNN case, given previous state in the decoder as and the the hidden state , we have something like this:

The index indicates the prediction step. Essentially, we define a score between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by ​ we will calculate a scalar (known as an “alignment score” per Bahdanau 2014): Then, to form a probability distribution, and to distribute the scores far from each other, is added into a softmax such that (known as the “weights” per Bahdanau 2014) is obtained: Lastly, intermediate representation of attention (known as the “context vector” per Bahdanau 2014) is obtained: Therefore:

  • Attention is defined as weighted average of values, where the weighting is a learned function
  • can be thought of as data-dependent dynamic weights

But, this formulation is independent of the choice in modelling attention.

Types of Attention Mechanism Formulations

NameAlignment score functionCitation
Content-base attention
where refers to the cosine similarity metric.
Graves2014
Additive(*)
which refers to the familiar neural network approach with an activation function.
Bahdanau2015
Location-Base
Note: This simplifies the softmax alignment to only depend on the target position.
Luong2015
General
where is a trainable weight matrix in the attention layer.
Luong2015
Dot-ProductLuong2015
Scaled Dot-Product(^)
Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.
Vaswani2017
Notes:
  • The symbol ​ denotes the predictions (previous sections used  or ​), while different  indicate trainable matrices.
  • (*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
  • (^) It adds a scaling factor , motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

Most popular till now is Additive(*) by Bahdanau:

  • Parametrize attention as small fully connected neural network
  • Means that attention is set of trainable weights
  • And that it can be tuned with standard backpropagaion

But this method limitations:

  • Computational complexity
    • NN to train with weights, where is the length of input and output sentence addressed by local attention

Local Attention

  • consider only a subset of the input units/tokens
  • can also be merely seen as hard attention since we need to take a hard decision first, to exclude some input units Note: colors in the attention indicate that these weights are constantly changing while in convolution and fully connected layers they are slowly changing by gradient descent

Self-Attention

Main Idea

  • Define attention of the same sequence
  • Instead of input-output sequence association, look for scores between elements of the sequence.
  • Can be regarded as a (k-vertex) connected undirected weighted graph. Undirected indicates that the matrix is symmetric.
  • instead of previous
  • Can be computed in any trainable way
  • End goal: create a meaningful representation of the sequence before transforming to another

Conclusion

Advantages of Attention

  • Usually eliminates the vanishing gradient problem, as they provide direct connections between the encoder states and the decoder.
  • Explainability
    • By inspecting the distribution of attention weights, we can gain insights into the behavior of the model, as well as to understand its limitations.

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks