Vision Language Models (Better, faster, stronger)
omnivore VLM transformers good-read! VLA
Read on Omnivore | Read Original
Highlights
Any-to-any models, as the name suggests, are models that can take in any modality and output any modality (image, text, audio). They do it by aligning the modalities, where an input from one modality can be translated to another ⤴️
multiple encoders (one for each modality) and then fuse the embeddings together to create a shared representation space. The decoders (multiple or single) use the shared latent space as input and decode into the modality of choice. ⤴️
After a certain point, the benchmarks saturated and scaling models had diminishing returns. The community went to shrink larger models through various methods, like distillation. This makes sense because it reduces compute costs, simplifies deployment, and unlocks use cases like local execution, enhancing data privacy. ⤴️
SmolVLM2, for instance, attempted to solve video understanding in these sizes and found 500M to be a good trade-off. ⤴️
gemma3-4b-it by Google DeepMind. It’s particularly exciting as it’s one of the smallest multimodal models to have 128k token context window, and supports 140+ languages. ⤴️
Qwen2.5-VL-3B-Instruct is worth noting. The model can do various tasks ranging from localization (object detection and pointing), to document understanding, to agentic tasks; with context length up to 32k tokens. ⤴️
Mixture of Expert (MoEs) models offer an alternative to dense architectures by dynamically selecting and activating only the most relevant sub-models, termed “experts”, to process a given input data segment. ⤴️
selective activation (done by a router) mechanism has demonstrated the potential to substantially enhance model performance and operational efficiency while utilizing fewer computational resources. ⤴️
MoEs are faster at inference than their similar parameter-dense counterparts because of the selective activation of a smaller slice of the network. They also converge quickly during training. ⤴️
MoEs need more memory cost due to the entire model being on the GPU, even if a smaller chunk is used. ⤴️
MoE layers are most commonly integrated by replacing the standard Feed-Forward Network (FFN) layers within each Transformer block. ⤴️
Vision language models that have mixture-of-experts decoders seem to have enhanced performance. ⤴️
VLAs take images and text instructions, and return text indicating actions for the robot to take directly. VLAs extend vision language models by adding action and state tokens to interact with and control physical environments. ⤴️
extra tokens represent the system’s internal state (how it perceives the environment), actions (what it does based on commands), and time-related information (like the order of steps in a task). These tokens are appended to the vision language input to generate actions or policy. ⤴️
VLAs are usually fine-tuned on top of a base VLM. ⤴️
VLMs enable generalization over traditional computer vision tasks. ⤴️
PaliGemma was the first model to attempt solving these tasks. The model takes in an image and text, where text is a description of an object of interest, along with a task prefix. The text prompt looks like “segment striped cat” or “detect bird on the roof”. ⤴️
For detection, the model outputs the bounding box coordinates as tokens. ⤴️
For segmentation, on the other hand, the model outputs detection tokens and segmentation tokens. These segmentation tokens aren’t all the segmented pixel coordinates, but codebook indices that are decoded by a variational autoencoder trained to decode these tokens into valid segmentation masks ⤴️
PaliGemma 2, appeared with the same capabilities and better performance. ⤴️
Molmo by Allen AI, which can point to instances with dots and count object instances. ⤴️
Qwen2.5-VL can also detect, point to, and count objects, and this includes UI elements as objects too! ⤴️
Multimodal retrievers take a stack of PDFs and a query as input and return the most relevant page numbers along with their confidence scores. The scores represent how likely the page contains the answer to the query, or how relevant the query is to the page. This bypasses the brittle parsing step.
The most relevant pages are then fed to the vision language model along with the query, and the VLM generates the answer. ⤴️
two main multimodal retriever architectures:
- Document Screenshot Embedding (DSE, MCDSE)
- ColBERT-like models (ColPali, ColQwen2, ColSmolVLM) ⤴️
Qwen 2.5 VL as of April 2025 is a good candidate for agentic workflows, as the model is further trained on agentic tasks. ⤴️
Most vision language models these days can handle videos, because videos can be represented as a sequence of frames. However, video understanding is tricky because of the temporal relationship between frames and the large amount of frames, so different techniques are used to select a representative set of video frames. ⤴️
Preference optimization is an alternative fine-tuning approach for language models that can also be extended to vision language models. Instead of relying on fixed labels, this method focuses on comparing and ranking candidate responses based on preferences. The trl library offers support for direct preference optimization (DPO), including for VLMs. ⤴️
Our model picks
Here are our picks for some highlighted models. There are many models that we like, the ones below are the latest.
Model Name Sizes Why we love it Qwen2.5-VL from 3B to 72B Great versatile model with agentic capabilities, math and more RolmOCR 7B Very performant OCR model Kimi-VL-Thinking 16B MoE with 3B active parameters Best reasoning model SmolVLM2 256M, 500M (our favorite!), 2.2B Smallest video language model Llama 4 Scout & Maverick 109B/400B MoE with 17B active parameters Loooooong context Molmo 1B, 7B, 72B and MoE with 1B active parameters Fully open model with localization capabilities on top