Layer-wise inferencing + batching: Small VRAM doesn’t limit LLM throughput anymore
Read on Omnivore | Read Original
Highlights
batching and layer-wise inferencing from disk, where we stream an LLM’s layers from the hard drive into VRAM, and run each layer against multiple in-progress prompts before moving on to the next layer. ⤴️
Layer-wise inferencing makes it possible to run large models, but not very desirable to do so.
If a “token” is a word, then the average human can speak 100-150 tokens per minute, and ChatGPT speaks about 6000 tokens per minute.
AirLLM gives us… 2 tokens per minute. Very slow indeed. ⤴️