Darius Knowledge Hub

❯

Parallelism

Jan 21, 20261 min read

knowledge

Parallelism

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Overview

Related fields

Lightweight AI, Embedded AI, Efficient AI

Introduction

from Switch Transformers, showing parallelism for Experts too, as per Mixture of Experts (MoE)

Data Parallelism

If model still fits on a single GPU, but your data no longer does
Distribute a single batch of data across GPUs and average gradients that are computed by the model across GPUs
Implemented in torch with the robust DistributedDataParallel library
Alternatively can use Horovod as a 3rd-party option

If can’t even fit your model on a single GPU… next 3 options

Sharded Data-Parallelism

e.g. ZeRO

Model Parallelism

can put each layer of your model on each GPU

Tensor Parallelism

distribute the matrix over multiple GPUs
e.g. Megatron-LM

Theoretical References

Papers

Articles

Lecture 2: Development Infrastructure & Tooling - The Full Stack

Courses

Code References

Methods

Tools, Frameworks

Graph View

Parallelism
Overview
Related fields
Introduction
Theoretical References
Papers
Articles
Courses
Code References
Methods
Tools, Frameworks

Backlinks

MIT 2023 - TinyML and Efficient Deep Learning Computing (Prof Song Han)
Efficient Transformers
Lightweight AI, Embedded AI, Efficient AI
Transformers

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community