RetNet - Retentive Networks

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

[2307.08621] Retentive Network: A Successor to Transformer for Large Language Models The Rise of RNN? Review of “Retentive Network: A Successor to Transformer for Large Language Models” | by Sehyun Choi | Medium Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Aug, 2023 | Medium

vs transformers

lower mem consumption
high throughput
lower latency

key features

parallel training
- by removing softmax, replacing with D-matrix and “GroupNorm”
  - why does softmax prevent parallel training?
- softmax typ serves 2 functions:
  - weight different time steps differently ⇒ D-matrix
  - introduce non-linearity ⇒ GroupNorm
recurrent inference
chunkwise inference

retention mechanism

main difference vs normal self-attention formula is this “position” matrix A^(n-m)
- positional embedding is in normal transformers actually
  - absolute embedding (normal, common transformer papers)
    - no relationship between positions
    - can be done in the inputs
  - relative embedding
    - in the matrix they have relative position against each
    - applied to the matrix itself (not possible to apply in the input stage, i..e apply to the Q K V matrices)
    - but not computationally feasible
  - rotatory embedding
    - represent the position using angle instead vs relative embedding
    - supposed to be more compute feasible than relative emb
    - retnet uses this! e^n\theta
D matrix
- triangular matrix, using \gamma to denote an exponential decay of the tokens based on the position

purpose of positional aware context

discount factor, and know where the current token is, and masking

how does it compare against those distilled / compressed versions of transformers?

Darius Knowledge Hub

Explorer

RetNet - Retentive Networks

RetNet - Retentive Networks

Graph View