Darius Knowledge Hub

❯

00_Omnivore_Highlights

❯

Layer wise inferencing + batching Small VRAM doesn't limit LLM throughput anymore

Layer-wise inferencing + batching- Small VRAM doesn't limit LLM throughput anymore

Jan 21, 20261 min read

omnivore
inference

Layer-wise inferencing + batching: Small VRAM doesn’t limit LLM throughput anymore

omnivore inference

Read on Omnivore | Read Original

Highlights

batching and layer-wise inferencing from disk, where we stream an LLM’s layers from the hard drive into VRAM, and run each layer against multiple in-progress prompts before moving on to the next layer. ⤴️

Layer-wise inferencing makes it possible to run large models, but not very desirable to do so.

If a “token” is a word, then the average human can speak 100-150 tokens per minute, and ChatGPT speaks about 6000 tokens per minute.

AirLLM gives us… 2 tokens per minute. Very slow indeed. ⤴️

Graph View

Layer-wise inferencing + batching: Small VRAM doesn’t limit LLM throughput anymore
Highlights

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community