Mixture of Experts (MoE)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

  • Introduced as a full network arch / ensembling method in 1991
  • Used as a type of layer / component in a deep network in 2013
  • Advancements in dynamically activating / deactivating components based on inputs led to usage of MoEs in NLP in 2017 as a component for LSTM

In context of Transformers

  • e.g. switch transformer 2 main parts:
  1. Sparse MoE layers instead of feed-forward layers / fully connected layers / mlp / linear layers in the transformer arch
    • Each expert is a NN (typically each expert is a fully connected network, or can even be more complex, or even a MoE itself hierarchical MoE)
  2. Gate network / router
    • Determines which tokens goes to which experts
    • e.g. above image “More” token sent to FFN 2 (expert 2), “Parameters” token sent to FFN 1 (expert 1)
    • Cn also go to more than 1 expert this is one of the big decisions when working w MoE

Benefits

  • Efficient pretraining
  • Faster inference vs dense models w same number of parameters
    • Since only expert is used?

Limitations

  • Training
    • Struggle to generalise during fine-tuning leads to overfitting
  • Inference
    • MoE might have many params, and only some are used in inference (the expert)
    • But, all the params must be loaded into RAM (even the unused ones)
    • High RAM requirements!
  • Gating function may cause specific experts to be undertrained as it overfits other experts

How does it work? Why does it work?

  • Understanding of how it can perform so well is rather unclear
  • Since each expert model is initialized and trained in the same manner, and the gating network is typically configured to dispatch data equally to each expert,
    • How each expert can become “specialized” in their own task, and experts in MoE do not collapse into a single model?
  • Attempts to interpret the hows - [2208.02813] Towards Understanding Mixture of Experts in Deep Learning
    • Cluster structure of the underlying problem and the non-linearity of the expert is pivotal to the success of MoE

Variants

MoE with Expert Choice Routing

  • Mixture-of-Experts with Expert Choice Routing – Google Research Blog
  • Founded on a different approach to assigning “experts” to “tokens” within a Mixture-of-Experts (MoE) model
  • Instead of assigning tokens to experts as traditional MoE models do, EC reverses this process, assigning experts to tokens based on their importance or difficulty.

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks