Analysing LLM Inference Costs


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Prefill stage vs Decoder stage

  • Prefill is when first prompt is fed into the model → tends to be compute bound
    • “time to first token”
    • where time taken to compute the FLOPs (limited by the GPU FLOPS capability) is more than the time taken to transfer data (weights)
  • Decoder stage is when doing next token generation → tends to be memory bound
    • where memory bound refers to the time taken to transfer data (weights, KV cache) in memory is actually longer than time taken to compute the FLOPs (not FLOPS), since FLOPs are lesser at this stage.

Articles