Parallelism
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview
Related fields
Introduction

- from Switch Transformers, showing parallelism for Experts too, as per Mixture of Experts (MoE)
Data Parallelism
- If model still fits on a single GPU, but your data no longer does
- Distribute a single batch of data across GPUs and average gradients that are computed by the model across GPUs
- Implemented in torch with the robust DistributedDataParallel library
- Alternatively can use Horovod as a 3rd-party option
If can’t even fit your model on a single GPU… next 3 options
Sharded Data-Parallelism
- e.g. ZeRO
Model Parallelism
- can put each layer of your model on each GPU
Tensor Parallelism
- distribute the matrix over multiple GPUs
- e.g. Megatron-LM