Mixture of Experts (MoE)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Overview

Introduction

Introduced as a full network arch / ensembling method in 1991
Used as a type of layer / component in a deep network in 2013
Advancements in dynamically activating / deactivating components based on inputs led to usage of MoEs in NLP in 2017 as a component for LSTM

In context of Transformers

e.g. switch transformer 2 main parts:

Sparse MoE layers instead of feed-forward layers / fully connected layers / mlp / linear layers in the transformer arch
- Each expert is a NN (typically each expert is a fully connected network, or can even be more complex, or even a MoE itself ⇒ hierarchical MoE)
Gate network / router
- Determines which tokens goes to which experts
- e.g. above image “More” token sent to FFN 2 (expert 2), “Parameters” token sent to FFN 1 (expert 1)
- Cn also go to more than 1 expert ⇒ this is one of the big decisions when working w MoE

Benefits

Efficient pretraining
Faster inference vs dense models w same number of parameters
- Since only expert is used?

Limitations

Training
- Struggle to generalise during fine-tuning ⇒ leads to overfitting
Inference
- MoE might have many params, and only some are used in inference (the expert)
- But, all the params must be loaded into RAM (even the unused ones)
- High RAM requirements!
Gating function may cause specific experts to be undertrained as it overfits other experts

How does it work? Why does it work?

Understanding of how it can perform so well is rather unclear
Since each expert model is initialized and trained in the same manner, and the gating network is typically configured to dispatch data equally to each expert,
- How each expert can become “specialized” in their own task, and experts in MoE do not collapse into a single model?
Attempts to interpret the hows - [2208.02813] Towards Understanding Mixture of Experts in Deep Learning
- Cluster structure of the underlying problem and the non-linearity of the expert is pivotal to the success of MoE

Variants

MoE with Expert Choice Routing

Mixture-of-Experts with Expert Choice Routing – Google Research Blog
Founded on a different approach to assigning “experts” to “tokens” within a Mixture-of-Experts (MoE) model
Instead of assigning tokens to experts as traditional MoE models do, EC reverses this process, assigning experts to tokens based on their importance or difficulty.

Darius Knowledge Hub

Explorer

Mixture of Experts (MoE)

Mixture of Experts (MoE)

Overview

Introduction

In context of Transformers

Benefits

Limitations

How does it work? Why does it work?

Variants

MoE with Expert Choice Routing

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks

Darius Knowledge Hub

Explorer

Mixture of Experts (MoE)

Mixture of Experts (MoE)

Overview

Related fields

Introduction

In context of Transformers

Benefits

Limitations

How does it work? Why does it work?

Variants

MoE with Expert Choice Routing

Theoretical References

Papers

Articles

Courses

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Backlinks