Parallelism


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

Data Parallelism

  • If model still fits on a single GPU, but your data no longer does
  • Distribute a single batch of data across GPUs and average gradients that are computed by the model across GPUs
  • Implemented in torch with the robust DistributedDataParallel library
  • Alternatively can use Horovod as a 3rd-party option

If can’t even fit your model on a single GPU… next 3 options

Sharded Data-Parallelism

  • e.g. ZeRO

Model Parallelism

  • can put each layer of your model on each GPU

Tensor Parallelism

  • distribute the matrix over multiple GPUs
  • e.g. Megatron-LM

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks