RetNet - Retentive Networks


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


[2307.08621] Retentive Network: A Successor to Transformer for Large Language Models The Rise of RNN? Review of “Retentive Network: A Successor to Transformer for Large Language Models” | by Sehyun Choi | Medium Retentive Networks (RetNet) Explained: The much-awaited Transformers-killer is here | by Shantanu Chandra | AI FUSION LABS | Aug, 2023 | Medium

vs transformers

  • lower mem consumption
  • high throughput
  • lower latency

key features

  • parallel training
    • by removing softmax, replacing with D-matrix and “GroupNorm”
      • why does softmax prevent parallel training?
    • softmax typ serves 2 functions:
      • weight different time steps differently D-matrix
      • introduce non-linearity GroupNorm
  • recurrent inference
  • chunkwise inference

retention mechanism

  • main difference vs normal self-attention formula is this “position” matrix A^(n-m)
    • positional embedding is in normal transformers actually
      • absolute embedding (normal, common transformer papers)
        • no relationship between positions
        • can be done in the inputs
      • relative embedding
        • in the matrix they have relative position against each
        • applied to the matrix itself (not possible to apply in the input stage, i..e apply to the Q K V matrices)
        • but not computationally feasible
      • rotatory embedding
        • represent the position using angle instead vs relative embedding
        • supposed to be more compute feasible than relative emb
        • retnet uses this! e^n\theta
  • D matrix
    • triangular matrix, using \gamma to denote an exponential decay of the tokens based on the position

purpose of positional aware context

  • discount factor, and know where the current token is, and masking

how does it compare against those distilled / compressed versions of transformers?