Mixed-Precision Training


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

  • Involves converting weights to lower-precision (FP16) for faster computation, calculating gradients, converting gradients back to higher-precision (FP32) for numerical stability, and updating the original weights with the scaled gradients

Automatic Mixed Precision Training

Float types

  • Typically use 32-bit float (pytorch default is 32 bit)
    • Single precision floating point
    • 64 bit double precision floating point not used in DL too compute expensive, not GPU optimised
  • Fraction component is normally called significand or mantissa
    • Related but not equivalent to the digits after decimal point
  • When use float 16, can lead to numeric overflow / underflow
    • Overflow: exceeding the max possible value of float16 (65,504), resulting in an Inf
    • Underflow: values between 0 and 5.9604645e-08 will end up resulting in 0
  • Bfloat16 extends the dynamic range compared to the conventional float16 format at the expense of decreased precision
    • easier to represent very large and very small numbers vs float16
    • originally developed for Google TPUs
    • supported by many NVIDIA GPUs check with torch.cuda.is_bf16_supported()

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks