Combining NVIDIA DGX Spark + Apple Mac Studio for 4x Faster LLM Inference with EXO 1.0 | EXO

omnivore

Read on Omnivore | Read Original

Highlights

Prefill is compute-bound

Prefill processes the prompt and builds a KV cache for each transformer layer. The KV cache consists of a bunch of vectors for each token in the prompt.

These vectors are stored during prefill so we don’t need to recompute them during decode.

For large contexts, the amount of compute grows quadratically with the prompt length (Θ(s²)) since every token needs to attend to all the other tokens in the prompt.

The data moved also grows quadratically with the prompt length (Θ(s²)) because we need to move the attention matrix.

Both are quadratic so the ratio between the compute and the data moved, i.e. the arithmetic intensity, is constant. However, this constant is usually very large and is on the order of the hidden dimension of the model (h) (e.g. Llama-3.1.8B has a h of 4096).

This means for large contexts, the arithmetic intensity of prefill is very large.

This makes prefill with large contexts compute-bound. ⤴️

Decode is memory‑bound

Decode is the auto‑regressive loop after prefill. Each step generates one token by attending against the entire KV cache built so far.

In decode, we are doing vector-matrix multiplications which have lower arithmetic intensity than matrix-matrix multiplications.

This makes decode memory-bound. ⤴️

  • Prefill → high compute device.
  • Decode → high memory-bandwidth device.

Prefill on DGX Spark, transfer KV, decode on M3 Ultra ⤴️ ^7a281ca9

Overlap communication with compute

The KV cache doesn’t have to arrive as one blob at the end. It can arrive layer by layer.

As soon as Layer 1’s prefill completes, two things happen simultaneously. Layer 1’s KV starts transferring to the M3 Ultra, and Layer 2’s prefill begins on the DGX Spark. The communication for each layer overlaps with the computation of subsequent layers. ⤴️