Layer-wise inferencing + batching: Small VRAM doesn’t limit LLM throughput anymore

omnivore inference

Read on Omnivore | Read Original

Highlights

batching and layer-wise inferencing from disk, where we stream an LLM’s layers from the hard drive into VRAM, and run each layer against multiple in-progress prompts before moving on to the next layer. ⤴️

Layer-wise inferencing makes it possible to run large models, but not very desirable to do so.

If a “token” is a word, then the average human can speak 100-150 tokens per minute, and ChatGPT speaks about 6000 tokens per minute.

AirLLM gives us… 2 tokens per minute. Very slow indeed. ⤴️