Mixture of Experts (MoE)
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview
Related fields
Introduction
- Introduced as a full network arch / ensembling method in 1991
- Used as a type of layer / component in a deep network in 2013
- Advancements in dynamically activating / deactivating components based on inputs led to usage of MoEs in NLP in 2017 as a component for LSTM
In context of Transformers

- e.g. switch transformer 2 main parts:
- Sparse MoE layers instead of feed-forward layers / fully connected layers / mlp / linear layers in the transformer arch
- Each expert is a NN (typically each expert is a fully connected network, or can even be more complex, or even a MoE itself ⇒ hierarchical MoE)
- Gate network / router
- Determines which tokens goes to which experts
- e.g. above image “More” token sent to FFN 2 (expert 2), “Parameters” token sent to FFN 1 (expert 1)
- Cn also go to more than 1 expert ⇒ this is one of the big decisions when working w MoE
Benefits
- Efficient pretraining
- Faster inference vs dense models w same number of parameters
- Since only expert is used?
Limitations
- Training
- Struggle to generalise during fine-tuning ⇒ leads to overfitting
- Inference
- MoE might have many params, and only some are used in inference (the expert)
- But, all the params must be loaded into RAM (even the unused ones)
- High RAM requirements!
- Gating function may cause specific experts to be undertrained as it overfits other experts
How does it work? Why does it work?
- Understanding of how it can perform so well is rather unclear
- Since each expert model is initialized and trained in the same manner, and the gating network is typically configured to dispatch data equally to each expert,
- How each expert can become “specialized” in their own task, and experts in MoE do not collapse into a single model?
- Attempts to interpret the hows - [2208.02813] Towards Understanding Mixture of Experts in Deep Learning
- Cluster structure of the underlying problem and the non-linearity of the expert is pivotal to the success of MoE
Variants
MoE with Expert Choice Routing
- Mixture-of-Experts with Expert Choice Routing – Google Research Blog
- Founded on a different approach to assigning “experts” to “tokens” within a Mixture-of-Experts (MoE) model
- Instead of assigning tokens to experts as traditional MoE models do, EC reverses this process, assigning experts to tokens based on their importance or difficulty.
Theoretical References
Papers
Articles
- Mixture of Experts Explained | Hugging Face
- Mixture of Experts: How an Ensemble of AI Models Act as One | Deepgram
Courses
Code References
Methods
- GitHub - stanford-futuredata/megablocks
- moe_lm
- GitHub - XueFuzhao/OpenMoE: A family of open-sourced Mixture-of-Experts (MoE) Large Language Models