LLM Inference Economics from First Principles
Read on Omnivore | Read Original
Highlights
The primary cost behind a generated token boils down to the cost of compute - you need to buy or rent a GPU. ⤴️
decoder-only transformer model, like Llama ⤴️
should be paying most attention to two metrics:
- Compute: measured in FLOPS - how many floating-point operations (addition and multiplication) a GPU can do in a second*.
- Memory bandwidth: how many bytes can be loaded from the global memory in a second. ⤴️
FLOPS and FLOPs mean different things. FLOPs (small s) is the plural of floating-point operations not considering time at all but FLOPS (capital S) means floating-point operations that happen within a second ⤴️
Arithmetic intensity is a concept that describes the ratio of computational operations, such as addition or multiplication (measured in FLOPs), to memory accesses (measured in bytes). ⤴️
higher arithmetic intensity indicates that the program performs more computations per unit of data fetched from memory, which typically leads to better utilization of the processor’s computational capabilities and reduced bottlenecking on memory bandwidth. ⤴️
LLM inference has both a phase that’s heavily compute-bound (very high arithmetic intensity) and a heavily memory-bound (very low arithmetic intensity). The majority of the wall clock time is spent in the memory-bound phase, so the goal of efficient LLM inference is to maximize the utilization of the GPUs’ compute capacity during the memory-bound phase. So increasing the arithmetic intensity for the memory-bound phase represents a fundamental optimization target that directly translates to improved inference economics. ⤴️
load the entire model with all parameters from global memory (HBM) (we utilize the memory bandwidth) and calculate the intermediate activations (we use the compute). ⤴️
two phases of LLM inference:
- prompt processing or so called pre-fill phase
- token-by-token or so called decoding phase
The end-to-end latency of an LLM request depends critically on the efficiency of both these two phases. ⤴️
For example, with H100 cards (see TFLOPS in Fig. 3), it would theoretically take roughly
291/989 = 0.29sto process a prompt of 2048 tokens.As a reminder, to load the model from global memory, we need to load
141 GBworth of parameters. The memory bandwidth of a modern GPU is around3350 GB/s, meaning that in theory it will take141/3350 = 0.04sto load the entire model from global memory - roughly 7x faster than the time needed for all of the computations.This demonstrates that in the pre-fill phase we are much more bound by the available compute than by the memory bandwidth. ⤴️
eliminate doing large parts of that computation over and over again by introducing a special cache. This cache is called KV cache because it stores the key and value matrices for each token position. ⤴️
at step
S+1, we’ve already calculated the attention between all of the firstStokens during the pre-fill phase. We can store these intermediate values in memory (the “cache”) and only calculate new attention values involving the most recently generated token. ⤴️
key efficiency gain comes from:
1. Reusing K^T_cache and V_cache from previous calculations
2. Only computing new key-value projections for the latest token
3. Reducing the attention calculation from O(S^2) to O(S) for each new token ⤴️
approximately
2048times faster than the pre-fill phase in terms of pure computation, but remember that we still need to load the entire model parameters (141 GB)from the global memory, and that now we need to also load the KV cache. ⤴️
as ⤴️
kv cache = 2 x bytes_per_param x num_hidden_layers x head_size x num_key_value_heads x sequence_length
amounts to: ⤴️
kv cache = 2 x 2 x 80 x 128 x 8 x 2048 = 671 MB
number scales linearly with the batch size (we elaborate on batches later) and with the sequence length (see Fig. 9). ⤴️
Model weights, plus KV cache, are roughly
141 + 0.6 ≈ 142GBso it takes142/3350 = 0.04sto load them from the global memory. We calculated above that it only takes0.00014sto do all computations (assuming 100% compute utilization) - so it takes two orders of magnitude more time to load the model weights than to do the actual computations. This is what we mean by the token-by-token phase of using LLMs being memory bound. We are primarily limited by the time memory transfer takes, not by the speed of compute. ⤴️
token-by-token phase is memory bound; the amount of available compute throughput is of secondary importance, as we massively underutilize the compute resources anyway while waiting for weights to be loaded. ⤴️
main challenges in LLM serving is understanding how input prompt length affects end-to-end performance. Sequence length impacts both the prefill stage and the token-by-token decoding phase, though in fundamentally different ways. ⤴️
for small batches and short sequences, the sequence length has minimal impact on throughput because loading model weights dominates the memory bandwidth utilization. However, as either batch size or sequence length increases, loading the KV cache takes an amount of time to load for each token, eventually surpassing the time consumed by loading of the model weights themselves. ⤴️
naive implementations of the attention mechanism memory would also scale quadratically when we realize the
SxSscore matrix, but flash attention replaces these naive implementations. Flash attention is calculating(Q @ Kᵗ) @ Vin an iterative way, where memory required is kept atO(N). ⤴️
two main ways to run large-scale AI models on multiple GPUs:
- Pipeline parallel
- Tensor parallel ⤴️
pipeline parallel (PP) setting, we split the model along the layers axis, meaning that each GPU will have a fraction of all layers ⤴️
split vertically, some portion of all layers in each gpu
upside of such an approach is the very limited communications between the devices. We only need to broadcast activations from one device to the next one 3 times for a single forward pass. ⤴️
comes at a price - at a single point in time, we only have access to 1/4 of the available compute and memory bandwidth per batch. So generation time is significantly slower than if we were to use all 4 GPUs at the same time. ⤴️
mainstream parallelism strategy, used by a vast majority of LLM inference providers, is tensor parallelism (TP). ⤴️
individual neural network layers are split across multiple GPUs, harnessing the combined compute power and memory bandwidth of all devices. ⤴️
split horizontally, different layer on different gpu
execution must synchronize across GPUs, introducing a significant delay (in the order of milliseconds) per synchronization event. This overhead varies significantly based on interconnect technology ⤴️
TP requires all GPUs to process the same batch simultaneously. A new batch cannot begin until the current one completes, reducing throughput efficiency under dynamic workloads. ⤴️
main limitation in terms of tokens/s we can get is how fast our GPU is able to load the model weights from the global memory. To generate every single token, we always need to load the entire model. ⤴️
Batches are a natural improvement because we can load the model weights once, but we can use the loaded weights to do inference with multiple items from our batch at the same time ⤴️
As we grow the batch size, we can effectively share the time to load the model from high bandwidth memory (HBM), aka our cost of loading the model is split across an increasing number of clients, enjoying the economies of scale and decreasing the per-request cost. ⤴️
As we approach really long sequences or really big batches, as we will see in our experiments, the memory footprint of the KV cache starts to slowly overtake the memory footprint of the model itself (see Fig. 15). When this happens, the cost of loading the model will become increasingly irrelevant to the total time of loading data from the global memory. ⤴️
The time to produce a response to a prompt is a combination of the pre-fill time and the decode time times the number of decode tokens. The more output tokens we produce, the smaller share of time will be spent in the prefill phase. The prefill time is primarily compute-bound, and the time for token-by-token is primarily memory-bound. ⤴️
prefill is so heavily compute-bound, we can estimate the time by dividing the number of floating-point operations during prefill by the total effective FLOPS across all of the GPUs, plus the extra latency from cross-GPU communication time. ⤴️
decode time is mainly memory bound, but as we increase the batch size, the compute component will become increasingly important. We will calculate both and take the more significant factor. We also spend some small amount of time in coms. ⤴️
The key to the good tokenomics is running the models at a large batch size, which is caused by the LLM inference being memory bound, meaning we want to share the cost of loading the model into the memory across as many requests as possible. As the KV cache size is approaching the model size, the total throughput gains are diminishing. ⤴️
In our model we assumed 100% memory and compute utilization. Theoretical peak performance metrics like TFLOPS and memory bandwidth are never fully achieved in practice. ⤴️
GPUs typically achieve only 50-70% of their theoretical compute capacity due to
- Kernel launch overhead and synchronization costs
- Memory access patterns that don’t perfectly align with cache architectures
- Warp divergence and other GPU-specific inefficiencies
- Instruction mix that isn’t optimal for utilizing all GPU units
- … and a lot of other factors ⤴️
In practice, we have tons of other factors that contribute to the potential extra latency, such as
- Having to sync the CUDA graph across devices for cross-GPU communication
- Synchronization barriers that force all GPUs to wait for the slowest one
- internal buffers and management overhead
- potential suboptimal (non-coalesced) memory access patterns, especially at larger sizes, with KV caches being stored in random pages of the VRAM memory
- and other factors we don’t mention here ⤴️
How challenging it is to correctly predict a model’s throughput is another thing that we hope you can take out of reading this text. Accurately estimating it would require a series of very detailed profiling steps, such as profiling the memory access patterns across multiple model sizes, batch sizes, and prompt length combinations; delving deeply into the exact implementation details of (paged) attention implementation; implementation details of the batch scheduling; estimating how different shapes affect the compute and memory utilization; and dozens of different experiments. ⤴️
When evaluating hardware for LLM inference, readers should understand that memory size is not the only important factor - memory bandwidth is equally, if not more, critical. Since token-by-token generation is primarily memory-bound, they should always ask, “What is the memory speed?” as this determines how fast models can run. ⤴️