RetNet - Retentive Networks
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
[2307.08621] Retentive Network: A Successor to Transformer for Large Language Models The Rise of RNN? Review of “Retentive Network: A Successor to Transformer for Large Language Models” | by Sehyun Choi | Medium Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Aug, 2023 | Medium
vs transformers
- lower mem consumption
- high throughput
- lower latency
key features
- parallel training
- by removing softmax, replacing with D-matrix and “GroupNorm”
- why does softmax prevent parallel training?
- softmax typ serves 2 functions:
- weight different time steps differently ⇒ D-matrix
- introduce non-linearity ⇒ GroupNorm
- by removing softmax, replacing with D-matrix and “GroupNorm”
- recurrent inference
- chunkwise inference
retention mechanism
- main difference vs normal self-attention formula is this “position” matrix A^(n-m)
- positional embedding is in normal transformers actually
- absolute embedding (normal, common transformer papers)
- no relationship between positions
- can be done in the inputs
- relative embedding
- in the matrix they have relative position against each
- applied to the matrix itself (not possible to apply in the input stage, i..e apply to the Q K V matrices)
- but not computationally feasible
- rotatory embedding
- represent the position using angle instead vs relative embedding
- supposed to be more compute feasible than relative emb
- retnet uses this! e^n\theta
- absolute embedding (normal, common transformer papers)
- positional embedding is in normal transformers actually
- D matrix
- triangular matrix, using \gamma to denote an exponential decay of the tokens based on the position
purpose of positional aware context
- discount factor, and know where the current token is, and masking
how does it compare against those distilled / compressed versions of transformers?